sturdysolutions

Foursquare’s 11-Hour Downtime: What Went Wrong

In Uncategorized on October 10, 2010 at 3:10 am

Foursquare was down for a grand total of 11 hours yesterday, all because of a pesky database shard.

In a detailed post-mortem, a FoursquareFoursquareFoursquare engineer details the “embarrassing and disappointing” issues that brought the service to its knees and kept it down for the bulk of the day yesterday.

Database sharding is a kind of horizontal partitioning that allows data to be stored by rows rather than by columns. Foursquare’s database is powered by MongoDB, which scales horizontally using shards.

Apparently, the service’s problems began yesterday morning when engineers noticed a disproportionate number of checkins being stored on one “shard” of the company’s database system. As the shard became more and more overloaded, engineers tried to introduce a new shard to the system to even out the load balance.

Unfortunately, the new shard took out the entire service for reasons unknown, from the website to the mobile apps; and every time the team tried to restart the site, the original shard kept overloading and bringing the works down again.

In the end, Foursquare had to re-index the shard, which took a grueling five hours. The company is happy to report no data was lost.

For the future, Foursquare is looking into preventing similar overloading issues, setting up safeguards and employing “artful degredation” to ensure one database error won’t take down the entire service. It’s also working directly with the fine folks at MongoDB to ensure greater stability.

On the end-user side of things, this incident has prompted the creation of a new blog, status.foursquare.com, as well as promises of hourly tweets from @4sqsupport during times of crisis.

Although these social media services seem to experience severe waves of downtime more often than we’d like to report, we’re grateful for the recent trend of developer-focused post-mortem analysis. They give armchair systems engineers food for thought, and they help the rest of us understand what makes a good and popular service a reliable one, as well.

Although 11 hours is a heck of a lot of downtime, this outage could have been worse. At least no checkins or — god forbid — users’ accounts were lost in the data/shard shuffle.

What do you make of Foursquare’s engineering predicament yesterday?

UPDATE: Whoopsie! Looks like Foursquare is down yet again.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: