POSTMORTEM OF SERVICE OUTAGE 4/11/2018 - 4/12/2018

26.4.2018
Por The Epic Team

On April 11th, 2018, we experienced an extended outage coinciding with the release of Fortnite 3.5. The outage blocked all logins for all players to our platform. We know many millions of you were excited about dropping from the Battle Bus with your friends, and it was a long time to wait to check out our 3.5 release. We sincerely apologize for the downtime.

We’re sharing more technical details in this post to give you a better understanding about what went wrong, what we did to fix it, and how we can prevent future issues like this from happening again.

Overview
At a high level, the Mongo database that backs our account service experienced severe performance degradation, which in turn caused our account service to be unavailable from April 11th 20:44 UTC to April 12th 14:30 UTC.  This means all login requests, account creations, and queries for account information failed.  With no way to query account information, there was no way to validate logins and accounts or process login requests. 

At the first sign of major impact (20:44 UTC), multiple engineers began to investigate the issue. Over the course of the next 17 hours, we attempted to bring services back online 7 different times. This outage left a deep scar on service availability:

 1.png

Over the course of the incident we found that call patterns caused cache pressure, connection storms knocked database members offline, node performance issues made things worse, and certain calls reading from primary were all contributing factors leading up to and extending our outage period.  We made many different changes to try and restore the account service and this covers the main issues and changes we made.

Call Pattern Changes in Fornite 3.5 Causing DB Cache Pressure
We had several call pattern changes in Fortnite 3.5 that slowly, over time, impacted database cache utilization.  Deep investigation into Mongo logs and support from the Mongo database teams helped us uncover this. However, it took us many hours to reach this conclusion.  A specific change in a call pattern caused us to reach new cache limits and cache evictions, which resulted in increased I/O to the point of performance degradation. This also uncovered a misreporting of I/O metrics in our database monitoring.  We are correcting this and it is covered in the “What’s Next” section below.

2.png

 Figure 1. Unexpected excessive http calls to an API in question after client release.  This aligns with the pattern in the graph in Figure 2 below.


We didn’t detect this immediately after release due to it being a slow multi-hour increase on a less-trafficked API.  With so much going on in our databases, such as members going offline as noted below, identifying this as a culprit took a significant amount of investigation.

Over the span of several hours, we experienced an increase in cache reads from disk into memory. This indicated the dataset was larger than what fit into RAM and required having to move data in and out, as indicated by the below chart for a replset (MongoDB replica set) secondary (following the leftmost brown line).


3.png

 

Figure 2: The above chart shows cache pressure on the cluster primary following deployment of v3.5, and also going into service instability.  Vertical lines represent instance restarts and leadership changes.  The blue line trends up as cache read pressure increases, until the instance can no longer keep up with both request and replication load.  The green line shows writes to disk, which also contributed to I/O issues.


Upon taking a closer look at cache behavior, we saw a sudden steep increase in the rate at which dirty cache was being accumulated (as seen in Figure 2). This was caused by heavy mutations and reads to the dataset, and, at our tipping point, I/O increased to a point of saturation.

 4.png

Figure 3: The above chart (times in PST) shows steady dirty cache (data written to RAM yet to be written to disk) on the cluster primary during the initial event.  This is a clear indicator that DB mutations were occuring at such a rate that exceeded server capacity.  At approximately 29GB of unsaved data, the mongod begins to have communication issues with the rest of the cluster.


We found that this cache pressure was sufficient to overload the current standing primary of our replica set, triggering leader elections.  

We gathered logs and metrics and provided them to MongoDB Support for analysis.  Their investigation started with what occurred in the hours leading up to our main availability issues.  They were able to confirm that our last Fortnite release dramatically changed the cache behavior, with page eviction occuring much more frequently due to increased reads into cache. This continued to progress until the cluster could no longer support the load, at which time it started to take longer to process queries, causing the client application to surge in creating TCP connections.

 5.png

Figure 4: MongoDB’s Professional Services team provided the above chart of our cluster’s issues between application deployment (A), and the DB beginning to have availability issues (B).  This chart also shows the increase in cache pressure, in particular the number of evicted pages on the primary server increases to 4.5x baseline before issues start.


From this point onward, our efforts turned to applying our lessons learned from partnering with several groups within MongoDB, primarily the Professional Services and Technical Services teams. Since the launch of Fortnite we have learned and applied many best practices and performance improvements in order to get the best scaling and availability characteristics out of MongoDB; our growth required it! 

Connection Storms
Our short connection timeouts to Mongo resulted in rapid connection churn and when the database performance became degraded, the number of connections to the database from the application spiked rapidly. These spikes caused us to hit maximum configured connection limits. When maximum connections were reached, even intra-replset communication failed causing nodes to drop in and out of a healthy state. These connection storms caused a cascading failure across our Mongo replica set.

We bumped socketTimeoutMS up to 30 sec from 200 ms to help mitigate connection thrashing on the DB side during issues. Previously, it was used as a primary timeout setting to mitigate slow query hiccups. On average majority of our lookup queries are handled under 10ms.

We also started passing ‘maxTimeMS = 500ms’ with queries as a new query timeout setting.  Rather than preemptively closing the connection after 200 ms, the database will terminate the operation after the maxTimeMS interval, keeping the connection open helping to mitigate connection storms during performance issues such as these.

We also moved to use mongos (MongoDB Shard Router), which has more connection controls for accessing to the mongod (primary daemon process for the MongoDB system) than the DB driver, to try and improve our Mongo connection handling.  

6.png