This article was sponsored by VictorOps. Thank you for supporting the sponsors who make SitePoint possible.
本文由VictorOps赞助。 感谢您支持使SitePoint成为可能的赞助商。
What happens when you’re notified that your mission-critical application or website goes down or isn’t working correctly? Sure, most teams have an extensive array of services that tell them when something isn’t right, but what do you do when it comes time to implement a solution? You need a plan, and the ability to implement it quickly. The DevOps approach may appear to be simple, but in actuality it’s a complicated one. Fortunately there are services that make such a process easier.
当接到通知您关键任务应用程序或网站出现故障或无法正常工作时,该怎么办? 当然,大多数团队都有广泛的服务,可以在发生不正确的情况时告诉他们,但是当实施解决方案时您会怎么做? 您需要一个计划,以及快速实施计划的能力。 DevOps方法看似很简单,但实际上它是一个复杂的方法。 幸运的是,有一些服务使此过程更加容易。
One of these services is VictorOps. VictorOps is there to help make DevOps easier, in all aspects of the on-call process. They do this by providing an extensive array of features, each one aimed towards better managing the inner workings of a diverse DevOps team.
这些服务之一是VictorOps 。 VictorOps可以在呼叫过程的所有方面帮助简化DevOps。 他们通过提供广泛的功能来做到这一点,每个功能都旨在更好地管理多样化的DevOps团队的内部运作。
The VictorOps platform includes features like on-call management, incident notification, advanced timeline capabilities, team collaboration, and alert annotations / transformations. Each one of these features can be customized and can prove useful for any team, regardless of size or capability. When VictorOps is supporting your team, many of the challenges faced by DevOps engineers begin to fade away ( alert fatigue is real!), allowing your team to return to a productive and helpful state.
VictorOps平台包括通话管理,事件通知,高级时间表功能,团队协作以及警报注释/转换等功能。 这些功能中的每一项都可以自定义,并且无论规模或能力如何,都可以证明对任何团队有用。 当VictorOps为您的团队提供支持时,DevOps工程师所面临的许多挑战开始逐渐消失( 警报疲劳确实存在! ),使您的团队恢复到富有成效和乐于助人的状态。
Each incident has a lifecycle of it’s own, from when the alert comes in to holding a post-mortem when it’s over. With that in mind, let’s take a look at how VictorOps helps you work through the incident lifecycle to help you solve the problem faster.
每个事件都有其自己的生命周期,从警报进入到结束事后举行事后调查,不一而足。 考虑到这一点,让我们看一下VictorOps如何帮助您完成事件生命周期,以帮助您更快地解决问题。
Sending your alerts to VictorOps
将警报发送到VictorOps
Alerts come through all the time, and while some do contain important information, many can be irrelevant and unhelpful. That being said, the first thing that VictorOps helps you achieve in the incident lifecycle is ensuring the right people are alerted for issues within their problem domain through advanced routing.
警报一直存在,尽管某些警报确实包含重要信息,但许多警报可能无关紧要且无济于事。 话虽如此,VictorOps帮助您在事件生命周期中实现的第一件事就是确保通过高级路由使正确的人员收到有关其问题域内问题的警报。
Advanced routing allows you to programmatically alert on-call team members to issues that need attention. Once VictorOps comes across an alert you’ve defined as critical, it begins the paging process in exactly the manner you’ve set up, itself open to plenty of customization and user specific settings. Users can be paged regarding incoming alerts via push notifications, SMS, email, and phone, all at specific and predetermined intervals.
先进的路由功能使您可以以编程方式提醒应召集的团队成员需要注意的问题。 一旦VictorOps遇到您已定义为严重的警报,它将以您所设置的完全相同的方式开始分页过程,该过程本身可以进行大量自定义和用户特定设置。 可以通过推送通知,SMS,电子邮件和电话以特定和预定的时间间隔向用户寻呼有关传入警报的信息。
The Transmogrifier is a recently launched VictorOps feature that works to greatly increase the value that an average alert can provide. This feature allows alerts to be escalated when certain conditions are met, annotated with documentation and/or developer specific notes, and much more. You can view a detailed overview of the Transmogrifier here.
Transmogrifier是最近推出的VictorOps功能,可以大大提高平均警报可以提供的价值。 此功能使警报可以在满足某些条件时上报,并带有文档和/或开发人员特定的注释等。 您可以在此处查看Transmogrifier的详细概述。
After filtering through your alerts and notifying the developers on-call, the VictorOps timeline is there to help you see the full scope of the incident as it unfolds. This timeline is accessible from desktop and mobile devices, allowing you to help solve issues in and outside of the office. The timeline is also multi-threaded, which means that you can leverage the timeline to gain situational awareness of other alarms from your systems that may be contributing to the issue, rather than only surfacing limited information relevant to a single alert. Think of the timeline as the main point of focus in the VictorOps platform. The timeline shows all alerts coming off of your system, who’s being paged, and conversation occurring pertaining to problem identification and resolution.
在过滤掉警报并通知开发人员待命之后,VictorOps时间轴可以帮助您查看事件的完整范围。 可从台式机和移动设备访问此时间线,从而帮助您解决办公室内外的问题。 时间线也是多线程的,这意味着您可以利用时间线来了解系统中其他可能造成问题的警报的情况,而不仅仅是显示与单个警报有关的有限信息。 将时间轴视为VictorOps平台的主要重点。 时间线显示了系统发出的所有警报,正在寻呼的人以及与问题识别和解决有关的对话。
Members of your DevOps team can see the Incident Pane, which is a distilled view of the critical alarms in their system. From there they can ack or re-route the issue to one or more teams, while also having the ability to filter the Incident Pane by items that are paging them, paging team’s they’re on, or all paging events.
DevOps团队的成员可以看到“事件窗格”,它是系统中关键警报的简化视图。 他们可以从那里确认问题或将问题重新发送给一个或多个团队,同时还能够按正在对它们进行分页,正在对它们进行分页的团队或所有分页事件来过滤事件窗格。
During the entire incident lifecycle, VictorOps provides extensive communication tools to ensure your team can work together. That includes Twitter conventions like @messaging and chat platform integration (although VictorOps has taken this concept further with @@messaging that allows you to ping an entire team in your chat). Speaking of chat, VictorOps offers robust integration with whatever chat client your company uses, including bidirectional integration with Slack and HipChat. Users can even chat into specific incidents to make their notes part of the log of incident resolution.
在整个事件生命周期中,VictorOps提供了广泛的沟通工具,以确保您的团队可以一起工作。 其中包括诸如@messaging和聊天平台集成之类的Twitter约定(尽管VictorOps通过@@ messaging使此概念更进一步,允许您在聊天中对整个团队进行ping操作)。 说到聊天,VictorOps可以与贵公司使用的任何聊天客户端进行强大的集成,包括与Slack和HipChat的双向集成。 用户甚至可以聊天特定事件,以将其注释作为事件解决日志的一部分。
Even though an alert may be solved, the incident lifecycle has not ended. It’s always important to collect information on how your team handles alerts, so that improvements can be made when necessary. That’s why VictorOps provides users with the Post-Mortem tool. This tool will allow you to pull a section of the timeline for use in retrospectives and reporting on SLAs for internal and external constituents.
即使可以解决警报,事件生命周期也没有结束。 收集有关团队如何处理警报的信息始终很重要,以便可以在必要时进行改进。 这就是为什么VictorOps为用户提供事前后期工具的原因。 使用此工具,您可以拉出时间表的一部分以用于回顾和报告内部和外部成分的SLA。
VictorOps supports ‘continuous documentation’ via Incident Frequency and Post-Mortem reports which facilitate discussion around whether all alerts are actionable and if so, whether the Runbooks and Triage documentation were up to date.
VictorOps通过“事件发生频率”和“事后故障”报告支持“连续文档”,这有助于讨论所有警报是否都可操作,如果是,则Runbook和Triage文档是否最新。
So let’s say you’re all set up on the VictorOps platform and your first alert comes through. What happens? Well, before the alert even gets to you, the Transmogrifier has been hard at work ensuring that the right alerts get to the right people, and that all of the information you need to solve the problem comes with it. You might even be able to stop there, simply because the Transmogrifier handles so much for you, and once difficult problems get solved in minutes. But let’s pretend that this alert is notifying you of a particularly challenging error. Using the custom filters that the Transmogrifier provides, a few other members are notified of the issue, ensuring that all of the correct people are on your team are in on the firefight. So what’s next?
因此,假设您都已经在VictorOps平台上进行设置,并且第一个警报通过了。 怎么了? 好吧,在警报尚未到达您之前,Transmogrifier一直在努力工作,以确保将正确的警报发送给适当的人员,并随附解决问题所需的所有信息。 您甚至可以停在那儿,仅因为Transmogrifier为您处理了很多事情,一旦难题在几分钟内得到解决。 但是,让我们假装此警报正在通知您一个特别具有挑战性的错误。 使用Transmogrifier提供的自定义过滤器,会向其他成员通知该问题,以确保您团队中所有正确的人都在交火中。 下一个是什么?
The next and most helpful thing to do would be to visit the VictorOps Timeline. Here is where you can get a birds-eye view of the incident as it unfolds. Since this particular issue is a big one, you’ll probably be getting a few other alerts and warnings related to it. Not to worry though, because the incident pane will allow you to see this coming from a mile away, and instead of getting confused and potentially wasting resources, you can disregard these new alerts, knowing they’ll disappear once the larger issue is resolved.
接下来要做的最有用的事情是访问VictorOps时间轴。 在这里,您可以一览事态的发展。 由于这个特殊问题很大,因此您可能还会收到其他与之相关的警报和警告。 不过,请不要担心,因为事件窗格将使您看到一英里之外的消息,而不是感到困惑和潜在的资源浪费,您可以不理会这些新警报,因为一旦解决了较大的问题,它们就会消失。
It’s a good thing your lead developers have access to the incident pane, because some of your team members are realizing that they’ll need assistance from a few other developers. After hearing this they’re quickly able to page more team members, bringing more support into the firefight. But how are you all staying in touch with one another? VictorOps chat integration of course! In the past, a lot of these issues would be solved over email, leading to confusion and poor response time. But now, you have the power of VictorOps at your fingertips, and with it comes a host of great communication tools, ensuring that all of your team members are on the same page.
您的主要开发人员可以访问事件窗格是一件好事,因为您的某些团队成员意识到他们将需要其他一些开发人员的帮助。 听到这些消息后,他们可以Swift调派更多的团队成员,为战斗增添更多支持。 但是你们如何保持联系? VictorOps聊天集成当然! 过去,许多此类问题将通过电子邮件解决,从而导致混乱和响应时间短。 但是现在,您可以轻松掌握VictorOps的强大功能,并且它附带了许多出色的沟通工具,可确保您所有团队成员都在同一页面上。
Eventually (quickly, hopefully!) the alert is finally solved, leading to minimal downtime and an overall positive experience for your team members. But we can’t stop just yet! At tomorrow’s morning scrum, you’ll want to go over the issue with your team, detailing what went wrong, how it was fixed, and what can be done better next time in order to prevent the issue. That’s where VictorOps’ Post-Mortem tool comes in. With the Post-Mortem tool, you’ve been able to pull the most relevant section of the Alert Timeline to show the most critical issues of the entire alert lifecycle. Using this information, you’re able to help your team form a plan to ensure that the issue you’ve solved today, doesn’t become an issue you’ll have to solve again tomorrow.
最终(很快,希望如此!)警报最终得到解决,从而使停机时间最少,并为您的团队成员带来了积极的整体体验。 但是我们还不能停止! 在明天的上午讨论会上,您将需要与您的团队一起讨论该问题,详细说明问题出在哪里,如何解决,以及下次可以做些什么来更好地防止该问题。 这就是VictorOps的事前事后工具的来源。借助事后事前工具,您可以拉出“警报时间轴”中最相关的部分,以显示整个警报生命周期中最关键的问题。 使用此信息,您可以帮助您的团队制定计划,以确保您今天解决的问题不会成为您明天必须再次解决的问题。
Using VictorOps allows for better communication, planning, and post-mortems before, during and after every incident. In the example I gave, a DevOps team with a basic plan and the tools to put it into effect managed to solve an issue in a much more organized, streamlined way than traditional approaches. The VictorOps platform not only enables faster response time, but also faster resolve time as well. Most importantly, VictorOps is there for your on-call team, providing a host of features to ensure their productivity, while limiting alert fatigue. If you’re interested in how VictorOps will work for your team, click here to try it out, free, for 14 days!
使用VictorOps可以在每次事件发生之前,之中和之后进行更好的沟通,计划和验尸。 在我给出的示例中,一个具有基本计划和实现该计划的工具的DevOps团队以比传统方法更加井井有条,精简的方式解决了问题。 VictorOps平台不仅可以缩短响应时间,而且还可以加快解析时间。 最重要的是,VictorOps为您的待命团队提供服务,它提供了许多功能来确保其工作效率,同时减少警报疲劳。 如果您对VictorOps如何为您的团队工作感兴趣, 请单击此处免费试用14天 !
To see how VictorOps can help you through the entire incident lifecycle, check out their guide that breaks each phase down individually.
要了解VictorOps如何在整个事件生命周期中为您提供帮助,请查看他们的指南 ,该指南可分别细分每个阶段。
翻译自: https://www.sitepoint.com/alert-post-mortem-manage-outage-victorops/