“The best laid schemes of mice and men, Gang aft agley”
“老鼠和男人的最佳产地计划,刚船尾敏捷”
Last week here at SitePoint we were very proud to relaunch the brand new SitePoint.com.
上周在SitePoint上,我们为重新推出全新的SitePoint.com而感到非常自豪。
However, those who were looking at the website the week before would have noticed we actually launched the week prior, only to rollback after some problems. Yes, despite rigorous testing of a perfectly functional staging and production deployment that had been in use for over a week, our best laid plans certainly went agley.
但是,那些在前一周浏览该网站的人会注意到我们实际上是在一周前启动的,只是在出现一些问题后才回滚。 是的,尽管对已经使用了一个多星期的功能完善的登台和生产部署进行了严格的测试,但我们制定的最佳计划无疑是敏捷的。
We’d like to share some lessons we learned or had reinforced through the experience, to help out those who might be relaunching existing websites.
我们想分享一些我们从经验中学到的或已经吸取的经验教训,以帮助那些可能会重新启动现有网站的人们。
No-one wants a launch to go badly. But sometimes it does and when things are going south, you want to be able to quickly flip to a status page that is a little more pleasant for your visitors to see instead of a horrible server error message.
没有人希望发射失败。 但是有时候确实如此,当事情向南走时,您希望能够快速切换到状态页面,该页面对于访问者来说更令人愉悦,而不是可怕的服务器错误消息。
The current SitePoint status page is hosted as a GitHub page which allows us to have an externally hosted page that shouldn’t be affected by any main site downtime.
当前的SitePoint状态页面作为GitHub页面托管,这使我们能够拥有一个外部托管的页面,该页面不受任何主站点停机时间的影响。
Test that you can switch to your status page quickly.
测试您是否可以快速切换到状态页面。
Rolling back to a previous version, whilst not desirable, should always be an option. As our new setup was being run on totally different infrastructure, we could quickly and safely rollback to our old site by changing a few DNS entries. If you need to run migrations over the existing data set, make sure you’ve taken a snapshot before you start your migrations.
回滚到以前的版本,虽然不是很理想,但应该始终是一种选择。 由于我们的新设置在完全不同的基础架构上运行,因此我们可以通过更改一些DNS条目来快速安全地回滚到旧站点。 如果您需要对现有数据集进行迁移,请确保在开始迁移之前已拍摄快照。
After we decided we needed to take more time to address the filesystem errors we were trying to fix, being able to roll everything back and get some sleep was very important.
在我们决定需要花更多的时间来解决我们试图解决的文件系统错误时,能够回滚所有内容并获得一些睡眠非常重要。
Never put yourself in a position where your ONLY option is to fix a broken setup.
切勿将自己置于唯一的选择来修复损坏的设置的位置。
Load testing is really important, and before we did our first launch, we used the excellent Loader.io to do some benchmarking against the current site, and the new setup. This allowed us to spot some caching inefficiencies and correct them, getting the new SitePoint to consistently hit DomContentLoaded in under two seconds, which is a threefold improvement over the old site!
负载测试非常重要,在我们第一次启动之前,我们使用了出色的Loader.io对当前站点和新设置进行了一些基准测试。 这使我们能够发现一些缓存效率低下的问题并进行纠正,使新的SitePoint在不到两秒钟的时间内始终达到DomContentLoaded的水平,这比旧站点提高了三倍!
Unfortunately, one area in which we failed was load testing over multiple pages. All of your visits are not going to hit the same page, so your load testing should reflect this as well. Visiting multiple pages is also going to put all parts of your technology stack under test. In our case, the part of our stack that fell over wasn’t in use on the homepage, so load testing here was never going to show the critical problem that was very quickly found out when we pushed the go live button.
不幸的是,我们失败的一个方面是对多页进行负载测试。 您的所有访问都不会到达同一页面,因此您的负载测试也应该反映这一点。 访问多个页面还将对您的技术堆栈的所有部分进行测试。 在我们的案例中,堆栈中未使用的那部分未在主页上使用,因此此处的负载测试永远不会显示出按下“运行”按钮时很快发现的关键问题。
As developers, we take a certain pride in knowing that what we build can take all kinds of stresses and load, and hate to think of our application falling over – that’s only natural. But, do you know how much load your application can take before it starts to split at the seams? And which part of your application will feel it first?
作为开发人员,我们为自己感到自豪,因为我们知道自己构建的内容可能会承受各种压力和负荷,并且不愿想到应用程序崩溃–这是很自然的。 但是,您知道您的应用程序在接缝处开始分裂之前会承受多少负载吗? 您的应用程序的哪一部分会首先感受到它?
While we were rebuilding the shared storage part of the technology stack, we hit our deployment with a huge amount of traffic, until it fell over. This allowed us to know how much traffic we could sustain (well over 10 times our regular traffic), what part of the stack fell over when under that pressure (the load balancers) and what we would have to do and how long it would take to get it working again (around 15 minutes).
当我们重建技术堆栈的共享存储部分时,我们的部署遇到了大量流量,直到失败为止。 这使我们知道我们可以承受多少流量(远远超过常规流量的10倍),在这种压力下堆栈的哪一部分掉落(负载均衡器),我们将要做的事情以及需要花费多长时间使它重新工作(大约15分钟)。
This kind of insight allows us to forward plan where we need to make improvements in our deployment, and we’ve already started our plans to reduce the complexity of our technology stack.
这种见解使我们能够将计划转发到需要改进部署的地方,并且我们已经开始计划以降低技术堆栈的复杂性。
This might sound like the most obvious advice in the world, right? I mean, who launches on a Friday or just before you’re about to head home? Right? Right?
这听起来像是世界上最明显的建议,对吧? 我的意思是,谁在星期五或您即将回家之前发射? 对? 对?
Unfortunately, almost all of us have made this mistake at least once in our career. We test things for days and days, are working like madmen to get them out the door before a deadline, and before you know it, its 4pm. Your boss says to you, “We ready to go?”, and you reply with the kind of optimism that really should have been blunted from many years of experience. “Sure, we’re ready to go!”
不幸的是,几乎我们每个人在职业生涯中都至少犯过一次错误。 我们日复一日地测试事物,他们像疯子一样努力工作,在截止日期之前和下午4点之前将它们送出大门。 您的老板对您说:“我们准备出发了吗?”,而您的回答是那种乐观的态度,这种乐观态度本来应该因多年的经验而变得平淡无奇。 “当然,我们准备出发了!”
So you push the go live button, things creak and strain, and look to be working fine. Congratulations are distributed all round and everyone goes home. A few hours pass, and then, everything starts happening.
因此,您按下了“上线”按钮,东西嘎吱作响,很紧张,看起来运行良好。 祝贺到处散发,大家回家。 几个小时过去了,然后一切开始发生。
After testing for numerous days, we pushed the button around 4.30pm on Wednesday afternoon, Melbourne time. That’s ahead of most of the time zones our users are in, from a few hours ahead of South East Asia through to 17 hours ahead of San Francisco.
经过数天的测试后,我们在墨尔本时间星期三下午4.30pm左右按下了按钮。 这比我们用户所处的大多数时区都要早,从东南亚提前几个小时到旧金山提前17个小时。
The first signs that something were up came around 7.30pm when people first started reporting slowdowns, and random disconnects. Then the disconnects become less random and more common, and before you knew it, the whole site was unresponsive. After some diagnosing, it was found that our shared storage solution running DRBD locked up, causing anything that accessed files on it to also lock up. Eventually this meant all Apache threads become locked up and no more requests were served.
人们最初开始报告速度下降并随机断开连接时,出现问题的第一个迹象出现在晚上7.30点左右。 然后,断开连接变得越来越不随机,越来越普遍,并且在您不知道断开连接之前,整个站点都没有响应。 经过诊断后,发现我们运行DRBD的共享存储解决方案被锁定,从而导致访问该文件的所有内容也被锁定。 最终,这意味着所有Apache线程都被锁定并且不再处理任何请求。
We worked on this problem for a few hours, trying to unlock the filesystem, and by around midnight the website was up and running again–for about 10 minutes. One of the DRBD nodes had a kernel bug that prevented any further saving, and at around 2.30am the tough call was made to rollback to the old website.
我们花了几个小时研究这个问题,试图解锁文件系统,直到午夜左右,网站才重新启动并运行-大约需要10分钟。 DRBD节点之一的内核错误阻止了进一步的保存,在大约2.30am做出了艰难的决定,要求回滚到旧网站。
After spending Thursday and Friday working on a different solution to WordPress’ shared storage conundrum, we had another potential opportunity to launch the website on Monday afternoon. However, not wanting to make the same mistake twice, the decision was made to launch first thing Tuesday morning. This proved to be a wise move, as inevitably there were small things that needed fixing up, and this was much easier to do with the whole day ahead, rather than after hours post launch.
在周四和周五花了很多时间解决WordPress共享存储难题的另一种解决方案后,我们还有另一个潜在的机会在周一下午启动该网站。 但是,不想再次犯同样的错误,于是决定在周二早上启动第一件事。 事实证明,这是一个明智的举动,因为不可避免地需要解决一些小问题,而且整天要比在发射后几个小时更容易做到。
In this age of launching applications from cloud services such as AWS and RackSpace Cloud, it is vitally important that you can bring up new servers with an absolute minimum of effort. Generally this means you’ve either baked a prebuilt ISO/AMI, and/or you use some combination of Chef, Babushka, Puppet etc.
在这个从AWS和RackSpace Cloud等云服务启动应用程序的时代,至关重要的是,您可以花最少的精力来启动新服务器。 通常,这意味着您已经烘焙了预构建的ISO / AMI,并且/或者使用了Chef,Babushka,Puppet等的某种组合。
For our new deployment we decided to use Salt which allows us to fire up new app/proxy/search/database nodes in minutes, and have them ready to slide into the stack as painlessly as possible.
对于我们的新部署,我们决定使用Salt ,它使我们能够在几分钟内启动新的app / proxy / search / database节点,并使它们准备好尽可能轻松地滑入堆栈。
As we re-tested our deployment, we made sure we were able to destroy and bring up new instances while the system was under stress testing. Once the site was live, we wouldn’t be able to ask all visitors to stop looking at it for a designated time period!
在重新测试部署时,我们确保能够在系统进行压力测试时销毁并启动新实例。 该网站上线后,我们将无法要求所有访客在指定时间段内停止浏览该网站!
One of the biggest failings of our first attempt at launch was not understanding the consequences of a lockup on our shared storage node. Whilst we mitigated this by replacing that part of the infrastructure completely, we then went to great lengths to test what would happen if other parts of the setup went missing.
我们首次尝试启动时最大的失败之一就是不了解锁在共享存储节点上的后果。 尽管我们通过完全替换基础架构的那一部分来减轻这种情况,但是我们竭尽全力地测试了如果缺少设置的其他部分会发生什么。
Of course, if you remove the database server, everything is going to fall over pretty quickly! But what happened when Memcached was no longer around? Or the ElasticSearch server disappears? By removing these nodes we ensured some level of resilience. Without Memcached, performance drops dramatically but still survives, meaning we have a window to get a new server operational. Without ElasticSearch we fall back to default WordPress search which while not as quick or nice, still works.
当然,如果您删除数据库服务器,一切将很快崩溃! 但是,当Memcached不再存在时会发生什么? 还是ElasticSearch服务器消失了? 通过删除这些节点,我们确保了一定程度的弹性。 如果没有Memcached,性能会急剧下降,但仍然可以维持,这意味着我们有一个使新服务器运行的窗口。 如果没有ElasticSearch,我们将退回默认的WordPress搜索,尽管搜索速度不那么快或不错,但仍然可以使用。
This kind of testing lets you perform practical dev-ops tasks such as bringing up new app nodes and adjusting configuration requirements. A model to consider is the Chaos Monkey introduced by Netflix to test system resilience and breakdown response times by randomly disabling production instances.
通过这种测试,您可以执行实际的开发任务,例如启动新的应用程序节点和调整配置要求。 Netflix引入的一个模型是“ 混沌猴子” ,它通过随机禁用生产实例来测试系统的弹性和故障响应时间。
It is an unfortunate part of life that not all eventualities can be accounted for, and no matter how much you plan, some things might go wrong. It’s vitality important that if this does happen, a team can band together and fix the problem quickly and efficiently without any finger pointing or blame laying.
不幸的是,并不是所有的意外事件都可以得到解决,而且无论您计划多少,某些事情可能都会出错。 至关重要的是,如果发生这种情况,团队可以团结起来并Swift有效地解决问题,而无需指责或指责。
SitePoint is fantastic in this regard, and as soon as issues started to present themselves, a ready and willing army of workers, including previous alumni, came and tirelessly helped debug and engineer a different plan of attack for the eventual re-relaunch.
SitePoint在这方面非常出色,一旦问题开始显现,一支准备就绪且乐于助人的工人队伍(包括以前的校友)就来了,并且不懈地帮助调试和设计了不同的攻击计划,以便最终重新启动。
Also important is the engagement that you have with your customers. We are lucky enough to have a loyal and understanding userbase, and the feedback through the downtime and restructuring was almost all positive, with fellow developers understanding the troubles that can sometimes happen during a big deploy. Having said that, we also never tried to hide behind the mistakes we made, and did everything to make sure the second time we launched a success.
与客户的互动也很重要。 我们很幸运能够拥有忠诚和理解的用户群,停机和重组带来的反馈几乎都是正面的,其他开发人员也了解大型部署有时可能会遇到的麻烦。 话虽如此,我们也从未试图掩盖自己犯下的错误,并竭尽所能确保我们第二次成功。
While the main thrust of these lessons may seem basic – test everything, don’t deploy at danger times – it is easy to gloss over some of the most obvious things if you are confident with your setup. As developers, we are often times amazingly optimistic in what we believe is achievable, and this can flow on to our faith in our infrastructure setups, leading to ignoring or putting aside well known guidelines.
尽管这些课程的主旨似乎很基本-测试所有内容,不要在危险时刻进行部署-如果您对设置充满信心,则可以轻松掩盖一些最明显的内容。 作为开发人员,我们常常对我们认为可以实现的目标感到非常乐观,这可能会影响我们对基础设施设置的信念,从而导致忽略或抛弃众所周知的准则。
翻译自: https://www.sitepoint.com/8-things-learned-relaunching-sitepoint/
相关资源:jdk-8u281-windows-x64.exe