Hyperscaling Without Hyperventilating
Nov 23, 2010 5:00 AM PT
Scaling a website up to the size of a Twitter or a Facebook brings its own set of problems.
These problems aren't always like those that crop up in moderately large sites -- they're unique to these massive, hyperscale systems.
That's partly because IT administrators and managers have to rely on new and sometimes emerging solutions to manage megasites, as traditional applications and tools can't handle jobs on that scale, Charles King, principal analyst at Pund-IT, told TechNewsWorld.
Megasites need a high degree of risk control. They require a high degree of automation if they are to work well. Running them means measuring everything you can.
Running a megasite also requires the ability to restore it rapidly from a crash.
"We work on a model where we realize we tend to lose a significant amount of infrastructure," John Adams, operations engineer at Twitter, said in a presentation at LISA 2010, the 24th Large Installation Systems Administration conference, held in San Jose, Calif., earlier this month.
"We're OK with having stuff go down -- provided we have fast ways of recovering from the failure," he said.
The Numbers at Twitter
Twitter has more than 165 million users who conduct 700 million searches a day, Adams said.
The site processes about 90 million tweets a day, which is "well over" 1,000 tweets a second, Adams stated. "Over the summer, we had the World Cup and the NBA finals and saw traffic go up to over 3,000 tweets a second," he added.
A "tweet" is the short message, 140 characters or fewer in length, that Twitter members send out. When a tweet comes in, it's sent off to 4 million people, Adams said.
Keeping On Keeping On
Sites as large as Twitter's require a different approach to building and deploying apps.
"We know things don't work the first time," Adams said. "You have to plan and build everything more than once."
The seat-of-the-pants approach works best when hyperscaling a site. "You have to scale your site using the best available technologies," Adams said.
That often means coming up with creative workarounds. For example, a lot of things IT relies on in Unix, such as Cron, fail when used on an extremely large scale. Cron is a time-based job scheduler in Unix-like operating systems. "Many machines executing the same thing cause micro outages across the site," Adams said "So you start doing tricks where you end up putting in random sleeps, or stagger jobs using a random hostname."
The Importance of Being Inventive
The inability of existing tools to handle hyperscaled sites also requires IT to use new tools or write some themselves. For example, Twitter's IT staff wrote an open source tool called "Peep" that lets users figure out what the optimal size for a Memcached memory slab is. Memcached is a general-purpose distributed memory caching system.
Twitter uses terabytes of Memcached, but its IT department learned that storing too much data in Memcached will cause data loss. Peep helps Twitter IT avoid overloading Memcached.
In the past six months, Twitter IT invented a sharding framework it calls "Gizzard." This lets them expand a dataset across hundreds of hosts, relieving system administrators from having to deal with sharding.
Database sharding is a method of horizontal partitioning in a database or search engine. Each partition is called a "shard."
"Sharding is a hard problem for people -- you have to come up with a way of distributing the data across the network and a lot of the schemes don't work," Adams said.
Being inventive can be a strain on staff. "Having to rely on new or emerging solutions to get the job done tends to require a very steep learning curve," Pund-IT's King warned.
Managing the Process
Everything must be automated when you hyperscale a site, Shawn Edmonson, director of product management at rPath, told TechNewsWorld. "With more than 10,000 servers, everything must be automated," he said. "So you need to think through every task that might arise and how you will automate it. How will you retire old systems? How will you roll back mistaken changes? How will you update the kernel?"
Configuration management is key to automation, Twitter's Adams said. "The sooner you start using configuration management, the less work you're going to do in the future and the less mistakes you're going to make," he pointed out.
"There's a tendency in hyperscale environments for small problems to grow larger almost inconceivably quickly," Pund-IT's King stated. "That's one reason automated processes and practices play such critical roles in these environments."
That's because a megasite is a monoculture, Edmonson explained. "An obscure Memcached bug might annoy one of the 100 heterogeneous systems at the average enterprise," he pointed out. "No big deal. But, in a giant monoculture, that bug will either affect every system or none. This magnifies the impact of bad change control and software drift."
Going Back to the Center
Companies that run a large IT infrastructure must have a centralized machine database, Twitter's Adams said. "Once you have that database in place, you can handle asset management, you can handle machine intake," he elaborated. "It lets you filter and list machines and find asset data."
Knowledge of asset data helps speed updates. Twitter uses the BitTorrent peer-to-peer system for deploying apps and updates. BitTorrent finds the profiles of the servers to be updated in Twitter's centralized machine database. It takes 30 to 60 seconds to update more than 1,000 servers through BitTorrent, Adams said. "This is a wonderful, legal use of P2P," he added.
Speed of deployment is crucial to megasites if they are to handle spikes in demand, Edmonson said. He recommends virtual machine image cloning as the best technology for this. However, imaging combined with continuous delivery requires automated image generation from a repeatable build process, once again requiring automation, he pointed out.
Keeping Down the Risk
Hand in hand with automation comes risk control. "With megasites, the application doesn't support the company; the application is the company," Edmonson said. A serious bug at any level of the stack -- the operating system, the middleware or the app -- could take down the whole service and, possibly, the whole enterprise, he explained.
That requires system architects to think through a fast but controlled lifecycle for the entire stack. "If admins can deploy changes at any level without going through an explicit release cycle, something awful's bound to happen," Edmonson explained.
Twitter's solution is to check and verify before any changes to the database are allowed. It uses a review board app and a version control system together with the Puppet configuration management system. If an admin wants to make changes, he first posts them on the review board and it draws comment from various people.
Then, Twitter runs various consistency checks and decides whether or not to accept the proposed changes. "I would say that a fair amount of our outages in the first year were human error because we didn't have these tools in place," Adams said.
This system also prevents people from logging into servers to make changes. "If someone logs into machines, something's wrong," Adams pointed out.
Gotta Know Your Variables
When running a complex megasite, it's important to get as much data as possible.
"Instrumenting the world pays off," Twitter's Adams said. "This is going to change the way that we manage systems so we graph everything." Twitter monitors 30,000 to 40,000 points in its infrastructure and graphics and reports critical metrics in as near to real time as possible, Adams said.
"We monitor latency, memory leaks, network usage," Adams stated. "When you collect this data for long periods of time you can do forecasting. We have this thing we call the 'Twitpocalypse,' when tweets pass two to the power of 32, and we can forecast when we'll hit the next one within a few hours of its occurrence." Two to the 32nd power is about 4.3 billion.
This type of forecasting is also possible with disk space or network loads, Adams said.
Twitter has a dashboard on which it puts up the top 10 metrics of the site.
"Measuring datacenter performance and responding to issues in near-real time are the name of the game today," Pund-IT's King said. "Profiling, forecasting and disarming problems before they occur is the next order of business."