No announcement yet.

ongoing outage - technical update


ongoing outage - technical update

  • Filter
  • Time
  • Show
Clear All
new posts

  • ongoing outage - technical update

    Quick update on outage(s) today. Sorry!!!

    3:30 AM - planned outage for release of v42

    5:30 AM - services back online after successful smoke test

    5:3x AM - services buckle under load

    By granting chests to existing players (some several hundreds) we inadvertently ended up DDOS’ing ourselves with a 200x increase in backend load due to chest opening process.

    We were unsuccessful in mitigating this issue with the services running and degradation of quality forced us to take the services back down.

    7:20 AM - unable to mitigate issue while system is running resulting in us taking downtime

    We changed code to bulk open chests drastically reducing backend load and deployed the change.

    9:40 AM - services are back online

    Service recovery was short lived and our DB setup got into a bad state between primary and replicas. Overall load was still too high for our infrastructure.

    11:00 AM - services offline

    We are using downtime to upgrade DB to latest version for an estimated 2x increase in load we can handle, are bringing over recent optimizations for profile handling from our experience with Fortnite, and are also working to compress our profiles to reduce load to combat the 2x increase in profile size introduced by v42.

    We expect this (*fingers crossed*) to allow us to handle the load, but also need to ensure we get the DB back into a synchronized state before we can go live.

    We are running into a bottleneck where a single person is responsible for all the remaining work (no pressure…).

    1:30 PM - technical update

    My apologies for the delay in getting information out to everyone!

    We are expecting the outage to persist for a while longer and will do a proper post mortem like we did with Fortnite’s recent outage here.

    We are also accelerating work to shard the DB.

    2:15 PM - technical update

    We have a list of tasks to complete, but no good way to provide a meaningful (aka accurate) update. We are also talking about aggressively limiting rate of new users when we bring services online again as we are not sure whether our current changes will be sufficient.

    3:15 PM - technical update

    We are running into and working through issues getting new MCP build deployed (currently failing unit test) to be able to test DB upgrade on our testing environment.

    We have a few additional operational items in-progress as well.

    It should be roughly 20 minutes to test that DB upgrade caused no harm after MCP deploy succeeds, and if testing is successful we will roll changes to Live (production) environment, sanity test, and start bringing folks back in slowly.

    4:15 PM - technical update

    MCP is being deployed to live testing environment. QA will sanity test changes there (20 min). Assuming nothing goes wrong (and it did previously) this will not be the longest pole.

    DB update needs to finish in production environment (unknown), followed by ensuring we have a valid backup (unknown), and enabling of compression (seconds).

    Once that is done we deploy to live (20 min) and do final testing (10 min).

    ​​​​​​​The times aren't additive, but at least 30 min after we have DB updated and backup is verified.

    5:30 PM - technical update

    Sadly not much to update other than us trying to parallelize as much as possible to reduce time to being back up again.

    5:45 PM - technical update

    DB changes / updates are mostly done, now getting ready to do deploys and testing. If everything goes well (and it rarely does) we should be online in an hour.

    6:00 PM - testing begins in staging / live testing environment

    We are testing changes in our staging environment. This is an important step given the scope of changes we made for v42 to ensure that we are not breaking your profiles. Some ongoing tweaks to production DB and once testing in stage and those changes are done we will deploy MCP changes to live / production environment.

    6:30 PM - testing in staging successful

    Testing was successful in staging environment. We are currently limited from verifying cross-play in this environment though.

    We are in process of doing last minute tweaks to DB which will be followed by deployment of MCP (there is a dependency), testing, and enabling waiting room to let players in. This has felt about 30 minutes out for 90 minutes so not sure how accurate my estimates are going to be here.

    7:10 PM - update

    Take this estimate with a huge grain of salt. We have around 30 min of DB work left on our production DB and then MCP deploy will take 20 min followed by testing so we are back to being about an hour out :-/ My apologies for not providing sufficient enough detail to make more sense of these updates as they read all very similar.

    7:45 PM - DB work completed!

    DB work has completed. MCP is currently being deployed to live (usually takes 20 min) which will be followed by some testing (usually takes around 10 min) after which we will start letting players in at an aggressively throttled rate to ensure we don't fall over right away.

    8:00 PM - MCP deployed, final QA testing begins

    MCP was successfully deployed and QA is verifying that change is not wreaking havoc with profiles.

    8:30 PM - back online

    We're back online, throttling players to monitor load. My apologies for the long wait and potentially bumpy ride ahead while we continue to monitor and investigate issues!

    9:00 PM - increasing rate we are allowing players back in

    Queue times should be accurate. We increased rate by 3x so queue times should see noticeable improvement.

    9:15 PM - increasing rate we are allowing players back in

    We increased rate to 5x original rate. Please try to restart your client / launcher if your queue time is longer that 30 min as you might be running into an issue of our queue not waiting properly.

    9:25 PM - increasing rate to 7x original rate

    9:50 PM - increasing rate to 10x original rate

    10:00 PM - draining queues

    Increasing rate to 15x of original rate. We are also past peak queue size and are steadily draining queues.

    10:15 PM - increasing rate to 20x original rate
    Last edited by [EPIC] Daniel Vogel; 08-09-2017, 02:16 AM.

  • #2
    Technology needs to step up. We so primitive.


    • #3
      ongoing outage - technical update

      It's not a technology issue.

      It's a planning and organization issue.


      • #4
        That poor guy... bet you a hundred thousand rep it's his first day too :D


        • #5
          ongoing outage - technical update

          Thanks for the update, keep up the hard work. <3


          • #6
            ongoing outage - technical update

            100 free coins for every hour the servers are down ? :) XD


            • #7
              DDOSSED by Chests. GG Epic.

              ​​​​​​​Thats funny, i hope you get your Servers running soon. And dont blame the guy responsible for that to hard. I thing he knows what he has done and will sleep bad this night.


              • #8
                No sql ftw!!!!


                • #9
                  ongoing outage - technical update

                  Originally posted by Lord Hazanko View Post
                  It's not a technology issue.

                  It's a planning and organization issue.


                  ​​​​​​​hope they get wiser...


                  • #10
                    ongoing outage - technical update

                    Thanks for the explanation! This kind of honesty and openness is appreciated and insightful!


                    • #11
                      ongoing outage - technical update

                      This is kinda funny, visualizing archives catching fire block by block as updates cascade through epic DB servers. Thanks for the update guys, I'll keep this in mind for the enterprise application I manage. Plan harder we must.


                      • #12
                        Thank you for the details! It is actually interesting to know how exactly it all went **** up :D


                        • #13
                          Keep it up.1st month founder here and im not going anywhere. I got 2 play 1 game b4 the issues and i LOVED IT


                          • #14
                            ongoing outage - technical update

                            up you go my friend ✋


                            • #15
                              ongoing outage - technical update

                              Thanks for the transparency and hard work getting it back online! I managed to play one bot match! Hype levels in the red zone right now!