Around 11:50AM UTC-7 there was a sudden and large spike of 500 errors coming from the server. I first noticed it about 12:15PM UTC-7 and then it was noticed at 12:32PM UTC-7 by our head mod @Krodmandoon. Due to being in the middle of work meetings I was unable to respond beyond some basic debugging.
The high level look I was able to get between having to be on camera made it look like the servers where overwhelmed so I decided to shut the forums down until I could look further into the issue that night hoping it would allow the servers some time to rest and recover from whatever was overwhelming them.
I restarted the servers around 8:10PM UTC-7. The servers looked to have stabilized but was still seeing 500 errors happening on and off. By this time the logs I was able to look over pointed to the back end DB being overwhelmed so I assumed this was due to the DB still being over taxed. So I set the forums to read only mode and decided to monitor for the rest of the night and check if it stayed stable in the morning.
Around 11:00PM UTC-7 I identified a bad config in cloudfront from an earlier debugging attempt while I was still at work. Once corrected most of the 500 issues seemed to have been corrected.
Around 8:40AM UTC-7 on 11/8 I verified everything looked to have still been working and took the site out of read only mode. I verified the root issue is we hit an IO limit on the SQL DB that caused AWS to throttle the DB which in turn made the forum unresponsive.
We may need to increase the DB disk size so we can have a larger IOPS burst pool to deal with spikes in IO operations. This may add an additional $92/month to the sites operating costs.
The reaction was slow due to work interference.
When I first identified the issue my initial thoughts was the K8 cluster was experiencing a cascading failure due to load. This happens when one or more of the pods fail for one reason or another but the remaining pods cant handle the current traffic causing them to fail before the bad pods could be brought back up.
Usually, the cluster will eventually be able to bring itself back up on its own but forcing a restart of all the pods fixes it 9 times out of 10. This time though restarting the pods did not help. As soon as they came back up they would show as stable but the 500 issues persisted and they would begin to crash after 5-10mins. Nothing changed to my knowledge which indicated this was likely an issue happening outside of the K8 cluster.
I could not fully rule out the cluster but the limited about of digging I was able to do between meetings the only way I could explain it is something was happening on AWS’s side or with one of their services.
At first I thought maybe something caused the pods to loose access to the backing SQL DB so I opened it up temporarily to see if that would solve the issue but no dice. I think made some changes to CloudFront to see if maybe the issues was there (this is important for latter) but once again the pods kept failing and the site kept returning 500 errors.
By this time I was starting another meeting that required me to keep my camera on during it so I decided to just shut the site down since I was fairly sure something got overloaded and give it time to rest until I could get back to it. While doing that I noticed the DB’s session count was really high. Usually its less than 1 but it was at about 33. I decided to also take this time to restart the DB to clear it of anything that may be hanging.
Due to my work and some prior engagements I was not able to really start looking into the issue until much latter that night. Once I did I noticed some very odd things in the logs and looking over the graphs.
One error that started popping up around the time was that we ran out our IOPS burst balance. Basically for those who dont know, AWS allows us a certain number of reads and writes per second. We can go above this number but only for short periods of time, which are those smaller dips you see in the graph. But once your go over that limit for too long AWS will throttle you which can make accessing the data very slow.
All other logs did not have any noticeable errors, at least none that did not point the the DB and the DB logs where clean outside of AWS’s warnings about exhausting the burst balance. With this info that looked to be the most likely issue causing the pods to crash and the 500 errors.
Once I thought the Burst Balance has recovered enough I deiced to bring the forums back online and monitor the DB. Luckily this seemed to work to an extent. The pods came back up and stayed stable! It looks like it may have finally been fixed, and while the 500 errors went down I was still getting 500 errors quite frequently.
I thought it was likely because the Burst Balance was still recovering so I decided to put the forums in read only mode and let it recover further.
Wile the forums where in Read Only mode some of the 500 errors where still bothering me. Mainly the fact that I was getting errors loading some styles and JS when trying to access the site in incognito mode. While some of this could be explained by the DB issues too many where failing to make sense and it did not seem to be improving. After looking through the logs again and double checking my settings I finally came across that setting I set for debugging earlier that day. Once I set the setting back to what it was supposed to be it seemed to fix most of the remaining 500 errors.
I left the server in read only mode to help the burst balance to recover for the rest of the night.
Once I got up and before I had to get on for work I checked the systems to ensure everything was still and everything looked stable and the Burst Balance was fully recovered. I brought the site out of read only mode and continued to monitor everything on the side as I worked through the day.
After reviewing all I could I am confident is saying the main issue was the IOPS burst balance being exhausted on the SQL DB. This caused data access to grind to a halt as AWS throttled the data access on the server. This in turned caused the pods to report as unhealthy which caused K8 to restart them in an attempt to fix the issue.
Even though I am fairly sure the IOPS caused the failure I am not fully sure what exhausted the burst balance. Looking at the historical graphs it looks like we do hit loads fairly frequently that bring us to about 30% balance and its been slowly increasing over time as the site grows. My current theory is a particularly heavy load during one of those spikes caused us to exhaust the burst balance. This then slowed the DB down and caused sessions to queuing up which explains the high number of sessions I saw in the dashboard. As the burst balance replenished the queued sessions would just consume it again keeping the DB effectively soft locked.
I am still not sure what may have caused this sudden spike but since we dont have advanced logging on for cost saving reasons this is about as far as I can go since neither the logs from the DB or Pods gave me any useful info.
Looking at the historical data I have available it looks like the chance of IOPS burst balance exhaustion happening is going to become more and more likely. To counter this we may need to upgrade our disk size which is the only way to increase the IOPS available to us along with the burst balance.
Unfortunately, this might increase our storage costs from $23/month to about $115/month as my light research so far shows AWS does not increase the IOPS until you pass the 500GB mark. I still need to look more into the GP2 vs GP3s to know for sure though and we may just deal with it since we are planning on trying to move onto GCP next year.
The short answer was it was just bad timing. As many of you know I work professionally as a Software Engineer as my day job and this helps me cover the operating costs of the site that are not covered by donations from our generous patreons. Due to this I sometimes am limited on when I can respond to issues with the site, especially during the work day.
What made this worse though is Tuesdays are the end of the sprint for the company I work for so I was in calls and meetings all day. To compound the issue further, the new leadership from the company that acquired us is pushing for remote workers to have their web cams on in all meetings which makes it even harder for me to respond when sudden issues like this pop up.
Unfortunately, since I am the only one available to fix these issues now sometimes responses to these incidents will be delayed. Luckily the automated systems we have setup so far have done a good job keeping the system in line most of the time, but we will have incidents like this they can not respond to every so often.
The issue experienced on 11/7 that resulted in an extended outage of the forums looks to have been linked to us overloading the DB. We have a way to help fix this but it will be a bit expensive increasing our operating costs by about $92/month so are considering if we want to even do this or wait until we move onto GCP next year. Also, due to changes of policy at my day job and bad timing I was not able to look into the issue until much later in the day.
I know this was a long one but hopefully it proved to be a bit interesting to those who like to peek behind the curtain. And I would like to thank everyone for the patience and understanding that was displayed during the outage!