Why do so many put their resources in AWS us-east-1 when that's the only one (that I'm aware of) that has ever gone down?

shalafi@lemmy.world · edit-2 3 months ago

Why do so many put their resources in AWS us-east-1 when that's the only one (that I'm aware of) that has ever gone down?

criss_cross@lemmy.world · 3 months ago

I feel like this is a bit of a loaded question that needs peeling back, as you can see from other replies.

For some context, I worked at AWS for a few years and worked at other companies of various sizes. I’m not breaking NDAs or anything here but I say this to say I’ve had a fair amount of exposure to this problem.

Also one thing I want to clear up. Other regions do go down for various services. You just don’t hear about them because they don’t have the catastrophic blast radius that us-east-1 does.

So let’s start with the external company part. “Why don’t others put resources in other regions?”. So let’s say you’re starting a new company. You’ve prototyped and built a web app. Most likely that app consists of the following components

a server to respond to requests
a database
a worker for asyncronous tasks (think sidekiq/active job for the rails folks)

Often at the start both items are just on the same box. This works up until you have a large web presence and suddenly can’t throw enough hardware at the problem to make it go away (for those that want the more technical term this is called “vertical scaling”).

So cool you want to take this app and make it regional now. There’s a lot of gotchas that come around from this that can bite you if you weren’t already accounting for this. I’ll list a few but there are numerous ones:

If you have your server write to a temp file and then read it in something else (like a worker) this won’t work when they’re not on the same box anymore. You need to put this in something like S3 or some intermediary that both have access to.
You have to be careful with how you partition requests to specific regions and make sure there isn’t anything local that’s gonna break if a user decides to access your app in one region, then takes a vacation into another.
The big thing: If you’re used to having 1 centralized database there are bad assumptions you can make that are hard to break out in code. A big one is this example:

POST /comment # writing a new comment
GET /story # loads all comments

If you work in a single db setting that GET will always get you the new comment a user made, but when you start doing some of the techniques to horizontally scale (read replicas, move to dynamodb, etc), THAT’S NO LONGER A GUARANTEE. You may be reading from a completely separate box that hasn’t gotten your change yet. There are guards + patterns for handling this but it’s not cheap to switch to these for old parts of the code base.

If your a start up that’s taking the plunge this is a large cost between re-architecting and code changes. It’s one you should absolutely incur. But it’s not cheap.

If your a massive company that’s not doing this then you’re playing with fire and deserve the pain you’ve wrought. Though I’d say most of the time large companies are doing this there’s just one small service that’s globally hosted that no one thought was important that actually was a critical part of the tool chain.

So let’s say your company has taken that cost and has done everything by the book. You can still get boned even when you don’t think you are.

At AWS there are a handful of “core” services. These services are the critical building blocks of everything at AWS. Think like EC2, Lambda, S3, and DynamoDB. A lot of internal and external training works towards having SDEs build almost everything with these key components (at least in some parts of AWS, there are others that use different tool chains. It’s a massive company I can’t pigeonhole everyone here). If you read a lot of their marketing slop you can see they encourage customers to use these as well.

Even if you don’t use any of the above, there’s a good chance that a service depends on one of the above anyway. I’ll give an example everyone can check. Let’s say you are starting out and building a brand new service. You wanna keep it simple and just make an EC2 box to keep your dependencies small. You make some code in CDK (amazon’s newer IaC tool) to make this box and go to make your first deploy. One of the first steps in this process is taking your artifacts and writing it to an S3 bucket in your region. If you wanna make deploys you now have an S3 dependency.

So if one of these massively goes down in a region it’ll most likely take other things with it.

Now let’s say you’re one of the companies that are doing all the right things and have a perfect region failover plan. Well you can still get hosed as there are certain services (like IAM and I think Route 53?) that are still globally hosted in us-east-1. Now if us-east-1 goes down your IAM goes down. And now you have issues even when you did everything by the book. I think they are trying to get rid of that issue but I have no idea (and I wouldn’t say even if I did lol). Even if it’s not us-east-1 I can guarantee there’s probably some other small things in other hosted regions that would have catastrophic effects like this.

TL;DR - shit’s hard. You can do everything right and still get fucked by this.