This article was sponsored by PagerDuty. Thank you for supporting the sponsors who make SitePoint possible.
本文由PagerDuty赞助。 感谢您支持使SitePoint成为可能的赞助商。
The word DevOps is a portmanteau of two words: development and operations, and it’s a relatively new term used in agile system administration.
DevOps一词是两个词的代名词:开发和运营,它是敏捷系统管理中一个相对较新的术语。
In the past, developers would build products, services and infrastructure, and then the responsibility for maintaining them would shift to sysadmins.
过去,开发人员将构建产品,服务和基础结构,然后将其维护的职责转移给系统管理员。
Instead, DevOps emphasises communication and collaboration between developers and IT system operators and integrates people better into the software workflow — from development to production — keeping a closer connection to agile values and principles.
相反,DevOps强调开发人员与IT系统操作员之间的沟通与协作,并将人们更好地集成到从开发到生产的软件工作流程中,从而与敏捷价值和原则保持更紧密的联系。
When you’re providing SaaS, customers expect 24/7 uptime. When your marketing team is selling your product as an indispensable piece of software, your customers won’t know what to do when it goes down. Adopting a DevOps approach means everyone is invested in writing good code and keeping uptime strong.
当您提供SaaS时,客户期望24/7的正常运行时间。 当您的营销团队将您的产品作为必不可少的软件出售时,您的客户将不知道当产品出现故障时该怎么做。 采用DevOps方法意味着每个人都被投入编写良好的代码并保持正常运行时间的投资。
No matter what the team structure, keeping systems up at all times inevitably means having your DevOps team on-call. Problems don’t come up according to a schedule, so you’ll need a team prepared to deal with them at all hours.
无论团队结构如何,始终保持系统正常运行都意味着让您的DevOps团队随时待命。 问题不会按计划出现,因此您需要一支随时准备解决这些问题的团队。
Being on-call can be daunting to a new team member. Solving problems and fixing issues is easy during the day when you have caffeine in the bloodstream, music on the stereo, and team members you can call on.
待命可能会使新团队成员望而却步。 当您的血液中含有咖啡因,立体声音乐以及可以拜访的团队成员时,解决问题和解决问题很容易。
But it’s a little different when you’re awoken at 2am by an SMS and need to quickly take action to fix an issue that’s potentially costing your company a lot of money every second.
但是,当您在凌晨2点被短信唤醒时,需要快速采取措施解决可能使您的公司每秒损失大量资金的问题时,情况有所不同。
Below are some tips I’ve picked up over the years of being on-call for SitePoint and other organizations.
以下是我多年来为SitePoint和其他组织服务的一些技巧。
Generally we make sure everyone who is on-call is aware of those on the same roster. Once an alert goes out, the person who responds first is tasked with fixing the issue directly, while others can help by offering advice and information to make the fix easier.
通常,我们确保待命的每个人都知道同一名单上的人。 发出警报后,首先做出响应的人员将直接解决问题,而其他人则可以通过提供建议和信息来帮助您使修复变得更加容易。
This approach avoids a situation where multiple developers, all trying to fix the same issue, unintentionally make the problem worse, or create new problems. This can be an implicit agreement, as we have, or someone can take the lead at the time. Either way, having a person “on point” keeps things clear and prevents further issues.
这种方法避免了多个开发人员都试图解决同一问题而无意间使问题恶化或产生新问题的情况。 就像我们已有的那样,这可以是隐含的协议,也可以由当时的人来领导。 无论哪种方式,让一个人“处于准时状态”都可以使事情变得清晰并防止进一步的问题。
Other implicit cues can help your dev team to be aware of a situation. I’ll only ever log in to our chat client out of office hours to address an issue, so if someone sees me online, they’ll know I’m busy fixing a problem and can offer help.
其他隐式提示可以帮助您的开发团队了解情况。 我只会在下班时间登录到我们的聊天客户端来解决问题,因此,如果有人在网上看到我,他们会知道我正在忙于解决问题并可以提供帮助。
Knowing other people are going to be woken up by a second or third alert if I don’t fix the problem quickly also acts as a strong motivator.
知道如果我不尽快解决问题,第二或第三次警报将唤醒其他人,这也是一个强烈的动机。
It’s also important to cultivate a culture of collaboration for DevOps. If someone is taking control of a fix, but is out of their depth and feeling under pressure, it’s important that they know they can ask for help from others without being drilled for being slow with a fix.
培养DevOps的协作文化也很重要。 如果某人控制了某个修复程序,但超出了他们的深度和压力承受能力,那么很重要的一点是,他们知道他们可以向他人寻求帮助,而不会因为使用修复程序而感到操心。
I take my laptop with me almost everywhere I go when I’m on-call. The exception is when I know I’ll be popping out for a short trip and access to a machine set up for work will only be a few minutes away. I’m quite used to it now, and it doesn’t hinder me.
我在通话时几乎随身携带笔记本电脑。 唯一的例外是,当我知道我会短暂出行并且只有几分钟之遥,可以使用为工作准备的机器。 我现在已经很习惯了,它并没有阻碍我。
Once you’re in the habit of ensuring a laptop is always close by, you don’t even think of it. The most frustrating aspect is if I want to go to the pool, where I may not hear alerts on my phone, or go for a run. If it’s the latter, I can still do so, but I’ll have to run laps close to home, rather than a long run that takes me far away from a computer.
一旦养成了确保笔记本电脑始终在附近的习惯,您甚至都不会想到它。 最令人沮丧的是,如果我想去游泳池,那里可能听不到手机上的警报,也不想去跑步。 如果是后者,我仍然可以这样做,但是我必须在离家很近的地方跑几圈,而不是长途跋涉,这会使我远离计算机。
This may sound obvious, but make sure your machine can cope with the tasks you’ll be asking of it. Does it have the credentials and certificates that you need? Does it have Bluetooth connectivity to enable tethering? If your on-call machine isn’t one you normally use, make sure it’s up-to-date and test it regularly.
这听起来似乎很明显,但是请确保您的机器可以应付您要询问的任务。 它是否具有您需要的凭据和证书? 它具有蓝牙连接功能以启用网络共享吗? 如果您的通话机器不是您通常使用的机器,请确保它是最新的并定期进行测试。
Make sure your house is on-call-friendly before an issue arises. You’ll need a consistent physical environment to make sure you can navigate it in the dark while stressed and confused after sleep.
在出现问题之前,请确保您的房屋随时待命。 您需要一个一致的物理环境,以确保您可以在黑暗中导航,而压力和睡眠后感到困惑。
I’ve stayed up late some nights, gone to bed only to be awoken by a SMS 20 minutes later. That’s the worst possible time.
我熬夜很晚,上床睡觉只是20分钟后被短信唤醒。 那是最糟糕的时间。
When I’m woken up in a haze, barely able to open my eyes or stand, at least I’ll know exactly where my phone is, where my computer is, and I can make it there without really having to think about it. From there I can set about fixing whatever problem has arisen.
当我被雾霾唤醒时,几乎无法睁开眼睛或站起来,至少我会确切知道我的手机在哪里,我的计算机在哪里,而且我可以将其安装在那儿而无需考虑。 从那里我可以着手解决出现的任何问题。
Another tip: Choose a dark background for your desktop wallpaper and keep the brightness turned down — or use an app like Redshift — to avoid being blinded as you log in to fix an issue.
另一个提示:为桌面墙纸选择深色背景并保持亮度调低-或使用Redshift这样的应用程序-避免在登录以解决问题时被蒙蔽。
Early on in my employment at SitePoint, when I first took over the sysadmin duties, almost every single day I was woken up at the very early hours of the morning by an alarm. I was a complete wreck by the end of it. That’s a pretty big motivator to fix things.
在SitePoint任职的初期,我第一次接管sysadmin职务时,几乎每天的每一天,我都会在早上很早的时候被警报唤醒。 到最后,我完全崩溃了。 这是修复问题的巨大动力。
After working on the underlying issues, now I get an alert maybe once a month outside of business hours.
在解决了基本问题之后,现在我可能每月在工作时间以外收到一次警报。
These days our infrastructure is such that issues are more likely to present themselves while everyone is in the office, when developers push new code.
如今,我们的基础架构使得当开发人员推送新代码时,每个人都在办公室时问题更容易出现。
Improving your infrastructure — and putting it to the test outside of a real problem — means you can be confident in your systems and know what to do when something goes wrong.
改善基础架构,并在实际问题之外进行测试,这意味着您可以对系统充满信心,并知道在出现问题时应采取的措施。
Having knowledge of your organization’s dependencies and how they relate can help you to quickly understand the root cause of an issue. A lot of times I’m able to solve problems quicker because I know how the system is laid out and so I know what to check first. Of course, the hardest problems are the ones you don’t expect.
了解组织的依存关系以及它们之间的关系有助于您快速了解问题的根本原因。 很多时候,我能够更快地解决问题,因为我知道系统的布局,因此我知道首先要检查的内容。 当然,最棘手的问题是您没有想到的问题。
Related to this: Make sure your documentation is up to date and covers all the bases. Bad documentation can hinder rather than help.
与此相关:确保您的文档是最新的并且涵盖所有基础。 错误的文档可以阻碍而不是帮助。
Of course, keeping yourself sane while on-call is easier when you have tools that get out of your way and help you focus on the things that matter.
当然,当您拥有一些可以避免干扰并帮助您专注于重要事情的工具时,在通话时保持理智会更容易。
PagerDuty is an operations performance platform aimed at giving you a single view of your infrastructure, meaning events and incidents can be handled by a team spread across the world, with everyone aware of issues as they come up.
PagerDuty是一个运营绩效平台,旨在为您提供基础架构的单一视图,这意味着事件和事件可以由遍布全球的团队来处理,每个人都可以在出现问题时意识到。
This level of situational awareness extends further, with the service offering detailed analytics measuring team and system performance for incident response, as part of their enterprise plan. With tools like these you can help improve your team’s mean time to acknowledge an incident, as well as its mean time to resolve.
这种情况意识的水平进一步扩展,该服务提供详细的分析来衡量团队和系统性能以进行事件响应,作为其企业计划的一部分。 使用此类工具,可以帮助您提高团队确认事件的平均时间以及解决事件的平均时间。
During an incident, alerts come through email, phone call, push notification, or through integrations with other services. The service has continuous routing and automatic escalation to make sure every alert is given the attention it warrants.
在事件发生期间,警报通过电子邮件,电话,推送通知或与其他服务的集成来进行。 该服务具有连续的路由和自动升级功能,以确保对每个警报给予应有的重视。
PagerDuty has more than a 100 integrations with services like AppDynamics, Crashlytics, New Relic and Sensu. But if a service you use isn’t on the list, the PagerDuty API can work with any system that can make an HTTP API call or send an email.
PagerDuty与AppDynamics,Crashlytics,New Relic和Sensu等服务进行了100多个集成。 但是,如果您使用的服务不在列表中,则PagerDuty API可以与可以进行HTTP API调用或发送电子邮件的任何系统一起使用。
When it comes to making on-call easier, PagerDuty has a wealth of scheduling options, with Follow-the-Sun schedules for global teams, meaning each team member in a given location can work during business hours (never wake up at 2am again!). There are also options for secondary on-call rosters to automatically escalate an incident if the first person does not respond (it happens!).
如果要使通话更加轻松,PagerDuty提供了丰富的计划选项,其中包括针对全球团队的“星期日”计划,这意味着给定位置的每个团队成员都可以在工作时间内工作(永远不会在凌晨2点醒来! )。 如果第一人称没有响应(发生!),辅助呼叫名单也可以自动升级事件。
The service is also smart about avoiding the “crying wolf” alerts, sending one alert for each incident in a service you’re responsible for, and only when that incident requires urgent action. If multiple services are generating alerts at the same time, PagerDuty will bundle the alerts and notify you once (you’ll still be able to see each individually).
该服务还很聪明,可以避免“狼来了”警报,仅在您需要紧急处理的事件中针对您所负责的服务中的每个事件发送一个警报。 如果多个服务同时生成警报,则PagerDuty将捆绑警报并一次通知您(您仍然可以单独查看每个警报)。
Once you’ve resolved an incident and have caught your breath, you can dive into what went wrong with the service’s detailed event timelines, and then look for root causes or trends with its analytics services.
解决事件并深呼吸后,您可以深入研究该服务详细事件时间表的问题,然后使用其分析服务查找根本原因或趋势。
Being on-call can be a daunting experience for a new DevOps team member. But with the right approach, a culture of collaboration, knowledge of the infrastructure, and the right tools, the experience of a 2am wake-up call can be manageable, and you can solve issues without losing too much sleep — or sanity.
对于新的DevOps团队成员而言,待命可能是艰巨的经历。 但是,如果使用正确的方法,协作的文化,基础架构的知识以及正确的工具,那么凌晨2点的叫醒服务的体验是可以管理的,并且您可以解决问题而不会失去太多的睡眠或理智。
How do you manage being on-call? Do you have any tips? Have you tried PagerDuty? Let us know in the comments below.
您如何管理待命? 你有什么建议吗? 您是否尝试过PagerDuty? 在下面的评论中让我们知道。
翻译自: https://www.sitepoint.com/beginners-guide-being-on-call/
相关资源:jdk-8u281-windows-x64.exe