Today’s Sporadic Outages – A Postmortem

Today our website went down approx 3:30 pm Pacific. A number of attempts to get it back online were successful only for a bit and as a consequence the website was out for a few hours.

Its back online now. We apologize for an inconvenience the outage may have caused and here’s what happened and steps we’re taking to ensure it doesnt recur:

Cause:

1. Our database, MySQL was shutting down. There were no indications of why, all logs related to MySQL were ‘clean’. Data verification was also consistent and there were no database level inconsistencies to cause MySQL to suddently ‘die’

2. Restarting MySQL would get the site back up and running but after a while we’d have the same issue of MySQL shutting down. An application without the database isnt of much use!

3. After a more thorough review of the complete system, the cause became clear. The Operating System (Linux) was shutting down MySQL.

4. The system was also running low on available memory

5. Turns out that when Linux is low on memory, it grabs the process with the largest memory footprint and shuts it down to make memory space. However in our case the memory hogger was the Ruby on Rails application, but since its divided into multiple instances it wasnt a candidate of the ‘single process with the maximum memory footprint’. That said, the memory profile for the Rails app was a surprise for us as we regularly profile the memory for our Rails app

6. Further research showed a recently released/used feature would start leaking memory based on certain types of inputs. We’ve since put it under close scrutiny and gotten the memory footprint down within the healthy range. As of now, the server and application are running smoothly.

Steps to ensure this doesnt recur:

1. While memory profiling is a time consuming activity and not always perfect in a controlled environment, we’ll be undertaking more of it moving forward

2. We’ll be putting server monitoring tools, specifically monit to more aggressive use by enabling its ability to bounce processes that go out of wack. If this situation were to recur it will restart passenger (out Rails App Server) before Linux steps in and makes a decision of shutting down one of the most critical parts of the application ie MySQL.

Again, I’m sorry for any inconvenience this may have caused, and thankful for your continued support and business. I can be reached at syed.ali at ezofficeinventory.com for any concerns.

regards

Ali
CEO
EZOfficeInventory.com