Since it arrived in 2004, Facebook has faced its share of unwelcome stories in the media, and it seems that security and privacy issues are a repeating theme. But when was the last time you heard about Facebook’s site crashing?
With more than 500 million active users worldwide, half of whom log in daily, Facebook chalks up some impressive statistics. It has a cache with 2 trillion objects in it, which it access 100 million times a second. Facebook processes 4 trillion feed actions every day. It stores 100 billion photo files.
Users collectively spend more than 20 billion minutes on the site daily and upload more than 2 billion items of content to Facebook every week.
Reliability is important to Facebook, and at the same time, it’s been one of the site’s biggest challenges as it scales, Robert Johnson, director of engineering at Facebook, said recently during a presentation at LISA 2010, a Usenix conference held in San Jose, Calif.
Over the years, Facebook has developed a set of principles that govern how it thinks about reliability, Johnson said. One of these is to always move fast — Johnson’s team does a full release of everything on the site once a week. The code is changing all the time, which means that its operations and infrastructure are always changing.
Can Facebook’s practices point the way for other megasites, especially those that need to cope with constant change? TechNewsWorld spoke with Johnson to discover how his team ensures the social networking giant remains up and running.
TechNewsWorld: You said the site’s constantly changing and you do a full release of everything on the site once a week. How do you cope with that?
It turns out that this works really well for us because each of those releases each week is a release of the application code, not the database software or things like that.
In terms of scaling, we learn things we didn’t know from one week to the next because we see problems we’ve never seen before.
Weekly, we’ll do some manual testing against the code and run live automated tests, and we’ll do the canary rollout where we roll the changes out to a small number of machines first. A handful of servers will be enough to take up the problems.
We find that if we do this weekly rollout we never push out a whole lot of things at once. If you batch up a whole lot of things and push them out at once, that lets you do a lot of tests, but we found that if something then goes wrong, it goes wrong badly. We prefer small, incremental changes so that if something goes wrong, we can fix it quickly.
Our general strategy is to make frequent, small changes. The production people like doing this. I like it because, when you have a problem in production, you usually spend the bulk of your time resolving it, and if you’ve only changed one thing, it’s a lot easier to figure out what went wrong.
So even when we do really big changes, we try to break them up into a series of small changes. So when we take a big back end and replace it, we bring both the old and new systems up in parallel and test the new one, slowly ramp the new one up and, once we’re comfortable with the new one working, we turn off the old one.
In order to have this work, you have to have things very well-instrumented. The key to weekly rollouts is to get very good measurements in place.
Detecting that something is wrong is something you have to automate. Once it comes to figuring out what is wrong and why, you get a person to look at the data.
TNW: You said at LISA 2010 that automatic failover is not always a good idea and that it’s best to have a human make the final decision when it comes to failover. Why?
It’s really important to have a person handle that last step. Typically, the worst failure is that something small goes wrong and causes a lot of things to cascade, and often those things that cascade are systems that are supposed to fix something. So they make things worse.
We make it easy to change things, so changing things on 1,000 servers requires one key press. But you still want a person to understand the problem and make a decision because we don’t want a ripple effect from making a change.
So you need to have in enough places a human looking at things who can understand what the consequences of an action will be. Often things kick in in automated systems in unexpected ways that aren’t a hard decision for people to make, but automated systems can’t make that decision easily.
TNW: Do you make multiple backups of your system before releasing changes, or set up restore points? Is that a necessary overhead?
Yes, for pretty much everything that can be backed up and restored. That’s really key when you follow this policy of making incremental changes and you’ve got to be able to flip back the change.
You make a small change and know why it broke, but you’ve got to be able to take it back. We put a pretty significant engineering effort into making any changes forward- and backward-compatible.
[When we make] any change to the state of the running system, we make sure to be able to flip that back easily and flip back and forth between the two seamlessly until we make sure anything works.
TNW: If you do a full release of everything on the site once a week, how do you cope? With massive virtualization?
We don’t do any virtualization. There’s not really a reason to do it in terms of efficiency because everything runs on multiple servers, and in terms of deployment and change management, doing virtualization is one option, but we didn’t go that route. We kick the idea around sometimes.
TNW: You said at LISA 2010 that your site has a typical Web 2.0 architecture. Every page is dynamic and is assembled out of hundreds or even thousands of pieces of information stored on Memcache servers all over the infrastructure. However, Memcache is where you have a lot of your scaling problems. Why do you have scaling problems with Memcache?
It takes the brunt of the load. The way we architected our site, everything happens in real time and every time users hit a page, they pull in very large amounts of data, and it’s all dynamic so you can’t pre-build almost anything.
You have to pull this massive amount of data across the network, and that requires a lot of bandwidth, and you get peak loads on the network and it’s just handling that massive flow. You can’t just break up the data like having all a person’s friends live on one server because everybody’s friends with everybody else.
TNW: You said that you don’t agree with the idea of not making mistakes again. You said the only way to not make mistakes is to not do anything, which is not a good way to live life. So, instead of focusing on how to not make that mistake again, you want to focus on how to make the mistake again and not have it be catastrophic.
How do you focus on this — how to make a mistake again and not have it be catastrophic? What do you do?
The philosophy is to essentially find places that take problems and amplify them, and there are a few different classes of that. The reason we always put a human in the middle of making recovery decisions is that we found some systems are very good at taking some problems and making them disappear, but they take others and make them worse.
Part of this is trying to understand how the system works in different failure scenarios, and another part is testing in different failure scenarios. Often problems look just like bugs. You want to root these out by testing them.
Another general technique is to separate things as much as possible. Anything you can separate, you should.
We can’t do that with our Memcache cluster, but we have several copies of that so they can fail independently, and we can test code by only deploying it to one of these clusters so we have a way of quickly detecting problems. A lot of it is keeping things contained.
TNW: What structure do you have for figuring out what to test? You can’t just conduct tests at random; you must systematize them in some way.
Johnson: The easiest way is around machine failures. They turn out to be a pretty good proxy for a lot of things. We go in and run tests on low traffic days — firewall of a couple of servers, log on to the machines, and put up IP tables so they stop talking on the network. We do other simple things — when you’re taking backups on databases, reload the backups all the time.
We essentially have a list of things we walk through testing and do one at a time. It’s much better to find a problem that way because you have people sitting there who know what you did and it doesn’t cascade as much as when it happens at random times.
TNW: You said systems must be robust to make mistakes. What do you mean by that?
You want to keep failures contained, so try to set things up so you don’t turn the failure of one machine into the failure of many machines. Having fast restart times turns out to be really helpful. Any particular component within our system that goes down, we try to pull it back up within a minute or two. We don’t always succeed.