How Battlehouse Games Achieves 99.99% Uptime on AWS

Dan Maas
11 min readJan 8, 2019

In my previous article I explained how Battlehouse Games cut cloud costs by 50% by optimizing AWS usage. Today’s story focuses on how we achieve nearly perfect 24/7 availability of our web-based games.

Battlehouse operates a portfolio of massively-multiplayer strategy titles on Facebook, serving over 5 million accounts with non-stop battle action in games like Thunder Run: War of Clans.

Traditionally, massively-multiplayer games suffer frequent service interruptions. For example, CCP’s EVE Online goes off-line for one hour of scheduled maintenance every single day. Blizzard’s World of Warcraft has a weekly maintenance window that often imposes several hours of downtime, especially around major content updates.

At Battlehouse, we believe that online games should meet the same level of 24/7 availability users have come to expect from popular sites like Facebook and YouTube.

Using modern web application architecture, we’ve demonstrated how to bring games up to the same level of robustness and reliability as these leading consumer sites.

I‘ve been managing Battlehouse’s core infrastructure since the launch of our first game in 2012. I’m proud to report that we’ve achieved over 99.9% uptime since then, and during the just-ended year of 2018, we added another “nine” to that figure!

In this article I’ll explain how we designed our back-end cloud architecture to achieve 99.99% service uptime across hundreds of major feature deployments and updates, with robust defenses against platform instability and external attacks.

Games as a Web App

Battlehouse offers free-to-play games over the open web using HTML5 standards. Players connect to titles like Thunder Run using a Javascript browser app that communicates with back-end servers running on Amazon AWS.

Although this architecture is similar to many modern web apps, our multiplayer games bring some unusual operational challenges beyond what a typical consumer app faces:

  • Long play sessions: Our most dedicated players stay connected to the game for several hours at a time. Even a brief interruption in server responsiveness can have a negative impact on gameplay. For example, during intense battles, dropping connection to the game server for more than a couple of seconds can cause a frustrating, unexpected defeat.
  • Frequent updates: We deploy new features and game content at least once per week, some of which require one-way migrations, like database schema changes.
  • Game-wide rule changes: Some updates affect the game-wide rules for player-vs-player combat. These updates must be deployed simultaneously to all players. It would create unfair advantages if some players gained access to the update earlier than others.

In the following sections, I’ll describe the methods we developed to tackle these special challenges.

Game Data Hot Reloads

Our game engine encapsulates title-specific game data, like the stats of units, buildings, and enemy forces, in a set of JSON files that are loaded when a server instance starts up. We added a server-side API that allows us to deploy a new version of this game data and “hot load” it into a running server without disconnecting any players. (to avoid problems with inconsistent data, we take special care to keep both the “old” and “new” data sets in memory, and switch freshly-connected players over to the “new” data atomically).

This hot-reload system allows us to deploy most minor updates without any churn among the back-end server instances. Bigger updates, like ones that add new engine code, do require us to stop and re-launch servers, but the majority of day-to-day updates can be handled with this simple hot-reload function.

Time-release Features

Early on, we released new game content, like new units and levels, by pushing a single large update to the full set of back-end servers. This forced us to make a coordinated “stop the world” deployment where all players had to be disconnected during the time it took to reload all server instances.

To reduce downtime, it‘s best to employ the “rolling” update strategy used by many common web services. In a rolling update, back-end instances are shut down, updated, and restarted one by one, so there’s never a moment when all users have to be disconnected simultaneously. Furthermore, instances running old code can be kept alive in a “draining” state until their last connected client disappears, which means there is no need to forcefully disconnect anyone during the entire roll-out process.

However, we could not use a rolling update strategy for new game content, because at any given time there would be some players able to use the updated content and some players without it, creating unfair advantages.

We solved this problem by adding “time-release” feature flags throughout the game data. Our game engine includes a domain-specific programming language called “predicates” that allows designers to enable or disable game content based on criteria like player levels or unlocked achievements. By adding an “absolute time” predicate, designers can arrange for game content to deploy in a hidden state, and then automatically become visible to players at a specific moment in the future.

This “time-release” strategy liberates us by de-coupling the deployment of a server update from the player-visible release of the feature. Today, we apply time-release predicates to the vast majority of content updates, enabling us to perform a rolling server update days, or even weeks, in advance of a feature actually becoming visible to players.

Rolling Update Management

Rolling updates are controlled through our in-house “PCHECK” backplane, which manages the steps of launching new server instances, routing players to them, then closing routes to old instances and switching them off when the last connections drain away.

PCHECK managing a rolling server update. Old instances are marked in red. They won’t receive any new connections, and will be shut down once the last existing connection closes. Newly-started instances are in green, and will receive all freshly-connecting players.

PCHECK’s roots go back to 2012, when there weren’t many off-the-shelf systems to accomplish this. If we were designing this system from scratch today, we’d consider building atop an existing ingress/deployment management system like Envoy, Spinnaker, or HAProxy.

Backwards- and Forwards-compatible Migrations

Mars Frontier has operated continuously since January 2012, serving over 2 million game accounts

During rolling updates, different server instances can be running different versions of our engine code, all while connected to the same back-end database. This adds special constraints when it comes to developing new features that change internal data structures or database formats. We spend extra engineering effort to limit the disruptions from running older and newer versions of the game simultaneously.

The main requirement is that all updates must be backwards-compatible. This means that newly-written code must be able to load and operate on game objects that were written by an earlier version of the server software. We accomplish this by dis-allowing any deletions or changes to the meaning of existing properties on game objects. New features are added by inventing new properties, and treating old objects as having some reasonable default value whenever an object is missing the new property. For cases where this is awkward or impossible, we added the ability to run a one-time “hook” function each time a player account is loaded. This hook can perform any one-time updates or re-formatting of obsolete data structures.

We also require that updates are forwards-compatible, as much as possible. During rolling deployments, there is a chance that an old server instance might read some player data that has been written out by another instance running newer code. This can happen, for example, when a player who is still connected to an old instance attacks a base that was just written out by a new instance. To avoid data loss, we wrote our object serialization code to preserve any properties it doesn’t recognize, writing them out with the same values as they were read in with.

In the vast majority of cases, these techniques allows us to deploy rolling updates with zero downtime. Very rarely, when a forwards- or backwards-compatible update is impractical, we schedule a brief window to “stop the world,” logging out every player so that we can ensure there is no mixing of old and new code touching live player data.

Database Availability

As described in my earlier article on AWS cost optimization, we use a “Hot/Cold” distinction in our storage architecture. “Hot” data needs to be highly available with low latency, but has a limited working size that grows only linearly with the number of connected players. “Cold” data grows without bound, but doesn’t need to be as fast or reliable. This distinction is very useful when it comes to designing our data storage for high uptime.

Hot data, like the contents of player home bases and armies, needs to be available any time the game is operating. Any outage lasting more than a few seconds will severely disrupt gameplay. We meet this high availability requirement by using MongoDB as our main back-end, deployed as a replica set on three redundant EC2 instances in different AWS availability zones. If any single instance goes off-line, another one takes over as master within a few seconds. MongoDB replica sets also allow rolling updates for database engine upgrades, storage compactions, and instance hardware swap-outs. So far, we’ve never had to take the entire MongoDB cluster off-line for any maintenance.

We also store some “hot” data in Amazon S3, which functions less well than MongoDB as a highly-available storage back-end. In fact, it’s our single largest source of unscheduled downtime (read on for details). We plan to eventually migrate this part of our storage system to PostgreSQL running on RDS, or a self-hosted Postgres cluster.

Cold data, like historical battle logs and game scores, is not vital for gameplay, and our servers are designed to operate normally even if a cold storage back-end is unavailable. For example, if the historical score database is off-line, players see a blank result if they search for old leaderboard scores, but otherwise they can play normally. This gives us the freedom to use simple, inexpensive storage options, like RDS-hosted PostgreSQL or S3. We are also free to take cold storage off-line to perform long-running maintenance tasks, like optimizing bulky SQL tables or migrating RDS software versions, without disrupting live gameplay. We don’t need to spend engineering resources on the complex problem of building high-availability storage for an ever-growing “cold” data set.

Causes of Downtime

Despite all of the above measures, we still incur some brief periods of unscheduled downtime. Here are the top causes:

Amazon S3 Outages

S3 plays a key role in player account data storage. Objects are loaded and saved to S3 every time a player logs in or out. Any failure of the S3 API can therefore stop users from accessing our games.

S3 is normally very reliable, and simple retry-on-error logic takes care of the vast majority of failed requests. However, every couple of months, we encounter periods of ten minutes or more where S3 fails all requests against a particular storage bucket.

S3’s Service Level Agreement implies a very high level of reliability. In our experience though, S3 failures have much more impact than the numbers suggest, because they can occur in a correlated fashion, affecting all of the objects in an entire bucket for an extended period of time.

As of today, we have no mitigation for these bucket-wide S3 failures, so they are the #1 reason for unscheduled downtime. In the future, we could address this problem by migrating player data storage to a different back-end that has more controlled reliability, like a self-managed Postgres cluster.

By the way, the original decision to use S3 was based on an early guess at our usage pattern that turned out to be incorrect. We planned for a very high number of player accounts — tens of millions — that wouldn’t interact with each other very much, whereas in reality, our games evolved to serve a smaller number of player accounts, but with much greater player-to-player interaction. A simple SQL or NoSQL database is a better choice for this usage pattern than S3.

EC2 Instance Hang-ups

Individual EC2 instances are generally reliable, but tend to suffer from unexplained system freezes or hardware degradation events once every few machine-years of operation. With a large enough fleet of EC2 instances, the chance for any one instance to lock up becomes a significant concern.

Our current game API server code is not very robust against a failure of the underlying EC2 VM instance. When an instance fails, we have to spin up a new one (a mostly automated process now, thanks to Terraform), adjust traffic routing, and perform some manual recovery steps. We hope to address this problem in the future by re-factoring the server code more along the lines of the Twelve-Factor App philosophy, which would enable us to containerize and manage disposable servers across a pool of redundant instances.

Human Error

Site reliability studies show that human mistakes, like configuration errors, are one of the top 5 reasons for service outages.

Battlehouse is not immune to human error! Even though our deployment process is almost fully automated, the remaining manual steps still leave enough room for the occasional mis-click or skipped step to temporarily interrupt access to a game title. Most often, this happens when we forget to enable the routing of new logins to a freshly-launched group of server instances before shutting off an obsolete group. Fortunately, we receive alerts about this situation within a couple of minutes, thanks to a monitoring system we built using Amazon Route53 Health Checks and CloudWatch alarms.

DDoS Attacks

Gamers can be extremely passionate about their virtual allies and enemies. We’ve experienced several cases where players launched DDoS attacks against our servers in an attempt to gain unfair advantages or to hinder other players. This is on top of the random undirected attacks that all public sites face these days.

Most of these attacks are not strong enough to bother our core ingress pipeline, which consists of an AWS Application Load Balancer feeding HTTP connections through redundant HAproxy servers. Once, several years ago, we suffered a sophisticated attack that overwhelmed the AWS load balancer and rendered the HAproxy machines unresponsive at the kernel level. We solved this by placing CloudFlare in front of the whole stack, which easily deflects these attacks before they touch AWS. We haven’t had a single case of downtime due to DDoS attempts since we started using CloudFlare.

Future Work

While we’re very happy with the high level of reliability we’ve achieved, there is always room for improvement. Top areas of concern are S3-based account data storage and the legacy, non-twelve-factor architecture of our game API server software. Fixing these issues would open up the option to add even stronger redundancy, like running a shared pool of game servers across multiple availability zones to be robust against complete zone outages. At some point, we’d like to pay down this technical debt, but for now, the top priority is developing and releasing better content for our gamers to enjoy!

--

--