RCA of the 11/7 outage

TLDR;

  • Around 11:50AM UTC-7 there was a sudden and large spike of 500 errors coming from the server. I first noticed it about 12:15PM UTC-7 and then it was noticed at 12:32PM UTC-7 by our head mod @Krodmandoon. Due to being in the middle of work meetings I was unable to respond beyond some basic debugging.

  • The high level look I was able to get between having to be on camera made it look like the servers where overwhelmed so I decided to shut the forums down until I could look further into the issue that night hoping it would allow the servers some time to rest and recover from whatever was overwhelming them.

  • I restarted the servers around 8:10PM UTC-7. The servers looked to have stabilized but was still seeing 500 errors happening on and off. By this time the logs I was able to look over pointed to the back end DB being overwhelmed so I assumed this was due to the DB still being over taxed. So I set the forums to read only mode and decided to monitor for the rest of the night and check if it stayed stable in the morning.

  • Around 11:00PM UTC-7 I identified a bad config in cloudfront from an earlier debugging attempt while I was still at work. Once corrected most of the 500 issues seemed to have been corrected.

  • Around 8:40AM UTC-7 on 11/8 I verified everything looked to have still been working and took the site out of read only mode. I verified the root issue is we hit an IO limit on the SQL DB that caused AWS to throttle the DB which in turn made the forum unresponsive.

  • We may need to increase the DB disk size so we can have a larger IOPS burst pool to deal with spikes in IO operations. This may add an additional $92/month to the sites operating costs.

  • The reaction was slow due to work interference.

So What Happened?

When I first identified the issue my initial thoughts was the K8 cluster was experiencing a cascading failure due to load. This happens when one or more of the pods fail for one reason or another but the remaining pods cant handle the current traffic causing them to fail before the bad pods could be brought back up.

Usually, the cluster will eventually be able to bring itself back up on its own but forcing a restart of all the pods fixes it 9 times out of 10. This time though restarting the pods did not help. As soon as they came back up they would show as stable but the 500 issues persisted and they would begin to crash after 5-10mins. Nothing changed to my knowledge which indicated this was likely an issue happening outside of the K8 cluster.

An External Issue

I could not fully rule out the cluster but the limited about of digging I was able to do between meetings the only way I could explain it is something was happening on AWS’s side or with one of their services.

At first I thought maybe something caused the pods to loose access to the backing SQL DB so I opened it up temporarily to see if that would solve the issue but no dice. I think made some changes to CloudFront to see if maybe the issues was there (this is important for latter) but once again the pods kept failing and the site kept returning 500 errors.

By this time I was starting another meeting that required me to keep my camera on during it so I decided to just shut the site down since I was fairly sure something got overloaded and give it time to rest until I could get back to it. While doing that I noticed the DB’s session count was really high. Usually its less than 1 but it was at about 33. I decided to also take this time to restart the DB to clear it of anything that may be hanging.

An Exhausted DB

Due to my work and some prior engagements I was not able to really start looking into the issue until much latter that night. Once I did I noticed some very odd things in the logs and looking over the graphs.
image

One error that started popping up around the time was that we ran out our IOPS burst balance. Basically for those who dont know, AWS allows us a certain number of reads and writes per second. We can go above this number but only for short periods of time, which are those smaller dips you see in the graph. But once your go over that limit for too long AWS will throttle you which can make accessing the data very slow.

All other logs did not have any noticeable errors, at least none that did not point the the DB and the DB logs where clean outside of AWS’s warnings about exhausting the burst balance. With this info that looked to be the most likely issue causing the pods to crash and the 500 errors.

Rebooting In Read Only

Once I thought the Burst Balance has recovered enough I deiced to bring the forums back online and monitor the DB. Luckily this seemed to work to an extent. The pods came back up and stayed stable! It looks like it may have finally been fixed, and while the 500 errors went down I was still getting 500 errors quite frequently.

I thought it was likely because the Burst Balance was still recovering so I decided to put the forums in read only mode and let it recover further.

Wile the forums where in Read Only mode some of the 500 errors where still bothering me. Mainly the fact that I was getting errors loading some styles and JS when trying to access the site in incognito mode. While some of this could be explained by the DB issues too many where failing to make sense and it did not seem to be improving. After looking through the logs again and double checking my settings I finally came across that setting I set for debugging earlier that day. Once I set the setting back to what it was supposed to be it seemed to fix most of the remaining 500 errors.

I left the server in read only mode to help the burst balance to recover for the rest of the night.

The Next Day

Once I got up and before I had to get on for work I checked the systems to ensure everything was still and everything looked stable and the Burst Balance was fully recovered. I brought the site out of read only mode and continued to monitor everything on the side as I worked through the day.

So What Caused This?

After reviewing all I could I am confident is saying the main issue was the IOPS burst balance being exhausted on the SQL DB. This caused data access to grind to a halt as AWS throttled the data access on the server. This in turned caused the pods to report as unhealthy which caused K8 to restart them in an attempt to fix the issue.

Even though I am fairly sure the IOPS caused the failure I am not fully sure what exhausted the burst balance. Looking at the historical graphs it looks like we do hit loads fairly frequently that bring us to about 30% balance and its been slowly increasing over time as the site grows. My current theory is a particularly heavy load during one of those spikes caused us to exhaust the burst balance. This then slowed the DB down and caused sessions to queuing up which explains the high number of sessions I saw in the dashboard. As the burst balance replenished the queued sessions would just consume it again keeping the DB effectively soft locked.

I am still not sure what may have caused this sudden spike but since we dont have advanced logging on for cost saving reasons this is about as far as I can go since neither the logs from the DB or Pods gave me any useful info.

How Can This Be Mitigated?

Looking at the historical data I have available it looks like the chance of IOPS burst balance exhaustion happening is going to become more and more likely. To counter this we may need to upgrade our disk size which is the only way to increase the IOPS available to us along with the burst balance.

Unfortunately, this might increase our storage costs from $23/month to about $115/month as my light research so far shows AWS does not increase the IOPS until you pass the 500GB mark. I still need to look more into the GP2 vs GP3s to know for sure though and we may just deal with it since we are planning on trying to move onto GCP next year.

Why Was The Response So Slow?

The short answer was it was just bad timing. As many of you know I work professionally as a Software Engineer as my day job and this helps me cover the operating costs of the site that are not covered by donations from our generous patreons. Due to this I sometimes am limited on when I can respond to issues with the site, especially during the work day.

What made this worse though is Tuesdays are the end of the sprint for the company I work for so I was in calls and meetings all day. To compound the issue further, the new leadership from the company that acquired us is pushing for remote workers to have their web cams on in all meetings which makes it even harder for me to respond when sudden issues like this pop up.

Unfortunately, since I am the only one available to fix these issues now sometimes responses to these incidents will be delayed. Luckily the automated systems we have setup so far have done a good job keeping the system in line most of the time, but we will have incidents like this they can not respond to every so often.

Conclusion

The issue experienced on 11/7 that resulted in an extended outage of the forums looks to have been linked to us overloading the DB. We have a way to help fix this but it will be a bit expensive increasing our operating costs by about $92/month so are considering if we want to even do this or wait until we move onto GCP next year. Also, due to changes of policy at my day job and bad timing I was not able to look into the issue until much later in the day.

I know this was a long one but hopefully it proved to be a bit interesting to those who like to peek behind the curtain. And I would like to thank everyone for the patience and understanding that was displayed during the outage!

20 Likes

Something that might just be a coincidence, but I heard that Twitch streams were down yesterday as well (Nov 7). I wonder if that had any relevancy?

2 Likes

One possibility for the spike could have been splash from some sort of DOS attack but cant say for sure. I dont think it may have been an internal AWS issue but its possible maybe something messed up there which caused session queuing on our side and simulated a larger load on the DB.

Hard to say for sure though due to our limited logging capabilities.

2 Likes

Just wanted to share that I appreciate your thorough, transparent report. I work in software dev, but haven’t worked much in managing AWS services directly, so I find info like this pretty neat. I know you mentioned viewing the historical data, but is it possible to infer when the IOPS were due to be maxed out by our increased user base?
If that’s the case, maybe you can calculate how much runway the upgrade would grant us, because I’d hate to see a repeat of this making the costs go out of control. I’ll definitely re-sub to give what little help I can for paying for the site.

6 Likes

goodness me you certain had a hectic few days to say the least but i am glad the site is back up again!.

As for my input after the conclusion i say wait until the change as not to add more complications for now and it’s Likely could of been a Ddos attack that somehow included the site that day plus if your planning on giving the site more disc space then perhaps we should also discuss adding a 2nd add window to compensate the additional costs that would add on to the site plus i people here would not want to have to make you work even more since you already work in quite a hard working job already.

1 Like

Were I a betting clam I’d guess the increase in traffic was due to web crawlers (ByteDance’s is particularily bad-mannered) plus a somewhat media heavy embedded-title-bar-announcement-thing (the one about all the halloween games?) It shows up on every page even for not logged in users.

Maybe just don’t show it for not logged in users? If that is an option.

1 Like

If it was consistent yes, but as can be seen on the graph above it happens in waves that looks fairly consistent but hard to estimate accurately without more data. Also burst IOPS is for short runs of sudden load then consistent load so makes it harder to calculate the runway.

That being said the reason 500GB is the magic number is at that size you can scale IOPS and IOPS burst independently from the disk using GP3 which would allow us to better adapt without spiking our costs. The only reason it spikes us so bad in this case is we would have to go from 100GBs to 500GBs to be able to do that which is a large jump in costs even though our DB is not even using 30GBs as of yet.

And thank you for wanting to rejoin our patreon, every bit helps! :slight_smile:

Ya I think for right now I may just keep a close eye on it and deal with it as it comes up. If it starts becoming more frequent before our switch over next year then maybe Ill start considering it more.

A fair point about the web crawlers. The banner though more than likely does not add much to the DB stress, especially since its cached in a redis cache and the media comes directly from the CDN since its all is hosted on a S3 bucket.

1 Like