We get a lot of questions about colocation from our customers and prospective customers. Questions like, “what is colocation?”, “why should we consider colocation?”, and “what should we be looking for in a colocation facility?”. Well, we figured, why not […]
Did you get the feeling last Tuesday that the internet broke? You weren’t alone – much of Amazon’s AWS Service became unavailable, meaning lame jokes and gif’s like the one below representing “the cloud is down” (thanks giphy) flooded our office chat, and most likely office chats around the country.
Though none of our services went down, we did provide support for customers using external services that were affected, such as Mailchimp. Other pages like Medium, Trello and Slack were unresponsive as well. Though no one was dramatically affected, this outage does raise concerns, especially as more and more businesses rely on the internet to operate.
First, how did this happen?
Last week Amazon released this blog post detailing exactly what caused multiple servers to go down. Essentially…a typo…yes, a typo caused a large chunk of the internet to fail. While this may seem shocking, it is actually quite common. Gitlab, for example, recently had an outage caused by an administrator accidentally deleting their database. In Amazon’s case, a command was entered incorrectly in an attempt to shut down a small set of S3 (Simple Storage Service) subsystems to fix a billing system bug. This caused more servers to go offline than originally intended, which required the engineers to do a full restart on the servers, resulting in a complete inability to service requests.
Amazon has responded that they will be making changes to the tool used to remove server capacity to make sure employees can’t remove as many at one time or as fast as they could before. This is just one change among others to ensure that a simple human error won’t have such a huge impact in the future.
This outage also brings up the issue of a centralized vs. decentralized web and what happens when too much of the web is hosted in one place.
As of 2012 it was estimated that Amazon hosted around 1% of the internet, and five years later Amazon hosts about 42% of the cloud market (by revenue). For this reason an outage at Amazon, or any one company that hosts a large portion of the web, will impact tons of websites. However, if the web was more decentralized, something that the founders of the web would actually like to see, we’d be far less likely to run into a situation where vast sections of the internet would be down.
While “the internet broke” might be a bit of an overreaction, the after effects of server outages are not something to take lightly. What web applications and websites does your company rely on, and could your business keep functioning if those services went down? Do you have a plan in case something like this was to happen? These are all things to consider as the web continues to grow and as we all move more and more to the cloud.
If your organization is truly worried about what would happen during a large scale outage, it’s important that all of your business-critical infrastructure not rely on a single provider. For example, one of our clients has their infrastructure spread out between 5 (almost 6) different locations and multiple geographic regions. This means that if something goes down, though they won’t have all their information, it will lessen the blow and still keep them online. This solution, though effective is costly. So if this is the approach you want to take, know that it’s not cheap, but if it’s necessary then cost isn’t such an issue.
While it’s possible to keep your data in different locations to help minimize the effect of outages, if you are completely reliant on SaaS (SaaS stands for Software as a Service – essentially you’re paying for a license to use the software) products like Google Apps, there is really not a whole lot you can do when they go down. Part of the trade off for the convenience and lower cost is that you don’t own or maintain the infrastructure to run them. So, during an outage the most you can do is keep refreshing that status page, and hope they get everything figured out soon, like we ended up doing last Tuesday.