Playground LTM Postmortem
We recently stood up our Playground LTM on June 27th, at about 4 AM EDT. Following this, we experienced an overload of our matchmaking service which caused both the default modes and Playground to fall over. We worked to get the service to where it needed to be, and were finally able to roll out the mode on the evening of July 2nd.
Our matchmaking is built on something called the Matchmaking Service (MMS), which is responsible for facilitating the “handshake” between players looking to join a match and an available dedicated server open to host that match. Each node in the matchmaking cluster keeps a large list of open dedicated servers that it can work with, randomly distributed by region to keep a roughly proportional amount of free servers for each. Players that connect to MMS request a server for their region, MMS assigns that player to a node, and the node picks a free server for the requested region from its list.
Since Playground mode makes matches for every 1-4 people instead of 100, it requires between 25 and 100 times as many matches as normal depending on party size. While we could pack virtual servers a bit tighter per physical CPU for Playground mode, we still had to use 15 times as many servers as we had been running for the other modes. We were able to secure the total server capacity, but it meant the list that each node had to manage was suddenly 15 times as long as well.
When an MMS node can’t find a free server for the requested region within its own list, it has to go ask all of the other nodes for a spare one by reading from each of their local lists. When you’re a node and your list is suddenly 15 times longer, it slows you down. When you have to go check all of the other lists and each one is also 15 times longer, it slows you down up to 15 times per node, which can translate to computation times that are orders of magnitude longer than normal. When we released Playground, the overwhelming demand quickly exhausted the local lists for MMS nodes far faster than the system could refresh them. Each node was running to every other node to request extra servers that just weren’t there yet, or at the very least took a long time to pick out of the non-local lists. The long compute times caused the CPU to end up with a backlog of pending requests, resulting in a feedback loop that eventually caused the system to grind to a halt.
What did we do to fix it?
The first thing we did after disabling the mode was to split Playground MMS to run on its own service cluster. This was necessary not only to keep a traffic jam from affecting the base game modes, but also to allow us to iterate and tweak the service as often as we needed while we worked to get Playground back online. We tried increasing levels of dramatic re-architecturing, and tested at each stage until we reached the acceptance criteria to re-release the mode.
Once we identified the root of the problem as the exhaustion of sessions from local lists, the solution was to give the cluster the ability to bulk rebalance sessions from other nodes to ensure repeated lookups were not necessary. With the system constantly shifting regional capacity from nodes with an excess to nodes that might be running low, the odds of a node running dry for a particular region and having to search outside its local list have been drastically reduced. While not an issue right now in the primary Fortnite Battle Royale game modes, this is an upgrade we are bringing over to the main MMS cluster as well to future-proof the system.
We pushed the load-testing process to the limits during our MMS restructuring, because the scale of what we were trying to simulate was so far beyond normal usage or testing patterns. We needed to spin up many millions of theoretical users and hurl them at our Playground MMS system in a big, crashing wave in an attempt to strain our new session rebalancer. While the tweak - test - evaluate cycle took several hours per loop, it allowed us to develop and refine the rebalance behavior to a point where we felt it could stand up to the traffic, as well as to identify and fix edge-case bugs that could have torpedoed the effort to bring Playground back online.
What have we learned?
In short, we learned a lot about our own matchmaking system and its failure points as well. We planned and prepared for what we thought to be the maximum sustained matchmaking throughput and capacity based on the size of our player base (plus a healthy buffer), but didn’t properly anticipate the edge-case of of the initial “land rush” of players exhausting local lists.
On the restart of the mode itself, we had an additional learning experience. We opted to bring back Playground in small steps by individual regions and platforms, with the goal of reducing the initial load on the system so we could scale into it. We actually encouraged the opposite, as players swapped regions into those that had the mode re-enabled and forced us to slow the rollout as we dealt with capacity issues. The silver lining is that we certainly have much better visibility into the total available cloud resources in Asia than ever before, and we want to give a shoutout to our cloud partners for working with us to ensure we could quickly adjust!
The process of getting Playground stable and in the hands of our players was tougher than we would have liked, but was a solid reminder that complex distributed systems fail in unpredictable ways. We were forced to make significant emergency upgrades to our Matchmaking Service, but these changes will serve the game well as we continue to grow and expand our player base into the future.