ongoing outage - technical update
1231151 ...

ongoing outage - technical update

420
Thread started by:
08-08-2017 05:35 PM in Announcements
Quick update on outage(s) today. Sorry!!!

3:30 AM - planned outage for release of v42

5:30 AM - services back online after successful smoke test

5:3x AM - services buckle under load

By granting chests to existing players (some several hundreds) we inadvertently ended up DDOS’ing ourselves with a 200x increase in backend load due to chest opening process.

We were unsuccessful in mitigating this issue with the services running and degradation of quality forced us to take the services back down.

7:20 AM - unable to mitigate issue while system is running resulting in us taking downtime

We changed code to bulk open chests drastically reducing backend load and deployed the change.

9:40 AM - services are back online

Service recovery was short lived and our DB setup got into a bad state between primary and replicas. Overall load was still too high for our infrastructure.

11:00 AM - services offline

We are using downtime to upgrade DB to latest version for an estimated 2x increase in load we can handle, are bringing over recent optimizations for profile handling from our experience with Fortnite, and are also working to compress our profiles to reduce load to combat the 2x increase in profile size introduced by v42.

We expect this (*fingers crossed*) to allow us to handle the load, but also need to ensure we get the DB back into a synchronized state before we can go live.

We are running into a bottleneck where a single person is responsible for all the remaining work (no pressure…).

1:30 PM - technical update

My apologies for the delay in getting information out to everyone!

We are expecting the outage to persist for a while longer and will do a proper post mortem like we did with Fortnite’s recent outage here.

We are also accelerating work to shard the DB.

2:15 PM - technical update

We have a list of tasks to complete, but no good way to provide a meaningful (aka accurate) update. We are also talking about aggressively limiting rate of new users when we bring services online again as we are not sure whether our current changes will be sufficient.

3:15 PM - technical update

We are running into and working through issues getting new MCP build deployed (currently failing unit test) to be able to test DB upgrade on our testing environment.

We have a few additional operational items in-progress as well.

It should be roughly 20 minutes to test that DB upgrade caused no harm after MCP deploy succeeds, and if testing is successful we will roll changes to Live (production) environment, sanity test, and start bringing folks back in slowly.

4:15 PM - technical update

MCP is being deployed to live testing environment. QA will sanity test changes there (20 min). Assuming nothing goes wrong (and it did previously) this will not be the longest pole.

DB update needs to finish in production environment (unknown), followed by ensuring we have a valid backup (unknown), and enabling of compression (seconds).

Once that is done we deploy to live (20 min) and do final testing (10 min).

​​​​​​​The times aren't additive, but at least 30 min after we have DB updated and backup is verified.

5:30 PM - technical update

Sadly not much to update other than us trying to parallelize as much as possible to reduce time to being back up again.

5:45 PM - technical update

DB changes / updates are mostly done, now getting ready to do deploys and testing. If everything goes well (and it rarely does) we should be online in an hour.

6:00 PM - testing begins in staging / live testing environment

We are testing changes in our staging environment. This is an important step given the scope of changes we made for v42 to ensure that we are not breaking your profiles. Some ongoing tweaks to production DB and once testing in stage and those changes are done we will deploy MCP changes to live / production environment.

6:30 PM - testing in staging successful

Testing was successful in staging environment. We are currently limited from verifying cross-play in this environment though.

We are in process of doing last minute tweaks to DB which will be followed by deployment of MCP (there is a dependency), testing, and enabling waiting room to let players in. This has felt about 30 minutes out for 90 minutes so not sure how accurate my estimates are going to be here.

7:10 PM - update

Take this estimate with a huge grain of salt. We have around 30 min of DB work left on our production DB and then MCP deploy will take 20 min followed by testing so we are back to being about an hour out :-/ My apologies for not providing sufficient enough detail to make more sense of these updates as they read all very similar.

7:45 PM - DB work completed!

DB work has completed. MCP is currently being deployed to live (usually takes 20 min) which will be followed by some testing (usually takes around 10 min) after which we will start letting players in at an aggressively throttled rate to ensure we don't fall over right away.

8:00 PM - MCP deployed, final QA testing begins

MCP was successfully deployed and QA is verifying that change is not wreaking havoc with profiles.

8:30 PM - back online

We're back online, throttling players to monitor load. My apologies for the long wait and potentially bumpy ride ahead while we continue to monitor and investigate issues!

9:00 PM - increasing rate we are allowing players back in

Queue times should be accurate. We increased rate by 3x so queue times should see noticeable improvement.

9:15 PM - increasing rate we are allowing players back in

We increased rate to 5x original rate. Please try to restart your client / launcher if your queue time is longer that 30 min as you might be running into an issue of our queue not waiting properly.

9:25 PM - increasing rate to 7x original rate

9:50 PM - increasing rate to 10x original rate

10:00 PM - draining queues

Increasing rate to 15x of original rate. We are also past peak queue size and are steadily draining queues.

10:15 PM - increasing rate to 20x original rate
#1
1
08-08-2017 05:38 PM
Technology needs to step up. We so primitive.
#2
260
08-08-2017 05:40 PM
It's not a technology issue.

It's a planning and organization issue.
#3
112
08-08-2017 05:41 PM
That poor guy... bet you a hundred thousand rep it's his first day too
#4
115
08-08-2017 05:42 PM
Thanks for the update, keep up the hard work. <3
#5
517
08-08-2017 05:43 PM
100 free coins for every hour the servers are down ? XD
#6
132
08-08-2017 05:44 PM
DDOSSED by Chests. GG Epic.

​​​​​​​Thats funny, i hope you get your Servers running soon. And dont blame the guy responsible for that to hard. I thing he knows what he has done and will sleep bad this night.
#7
5
08-08-2017 05:44 PM
No sql ftw!!!!
#8
2
08-08-2017 05:45 PM
Quote Originally Posted by Lord Hazanko View Post
It's not a technology issue.

It's a planning and organization issue.

/signed

​​​​​​​hope they get wiser...
#9
171
08-08-2017 05:45 PM
Thanks for the explanation! This kind of honesty and openness is appreciated and insightful!
1231151 ...