User reports site outage
From a user email:
At about 5.30 pm on Monday received this error message when trying to access material on the Commons (and the Commons directly):
Error establishing a database connection
I had previously been on the Commons site.
This is in line with chance outages that I've experienced occasionally, as well, but this is our first user report of the same. André, can you look at the logs and let us know whether you notice anything?
#1 Updated by local admin over 6 years ago
I received an alert for this:
***** Nagios 2.12 ***** Notification Type: PROBLEM Service: mySQL Host: commons.gc.cuny.edu Address: commons.gc.cuny.edu State: CRITICAL Date/Time: Mon Sept 3 17:33:45 EDT 2012 Additional Info: Too many connections
I would be very surprised if this wasn't an offshoot of http://redmine.gc.cuny.edu/issues/1962. Solving this will more than likely necessitate solving that as well. Let me see what I can dig up.
#6 Updated by local admin over 6 years ago
Boone Gorges wrote:
Great tip - I haven't heard of Nagios!
Dude, seriously nagios is so awesome. Working with it is one of the more satisfying, non-sucky parts of the day.
If you can easily send your config files, it'd give me a huge leg up. Thanks for offering :)
Done. let me know if I can help you implement it and get it running.
#7 Updated by local admin over 6 years ago
- Status changed from Assigned to Resolved
I looked at the historical data for the database service and seems like while we're averaging well under the default limit of 151 connections, there are sporadic spikes that hit that ceiling and more than likely cause these brief connection failures. We should continue to try and optimized things on two fronts: optimizing database utilization by the application and continuing to tune mySQL for optimal performance. We should probably track both these efforts on on http://redmine.gc.cuny.edu/issues/1962.
Immediately I increased the limit on number of connections to 500, which should prevent the failed connections while we move forward with diving deeper into the optimization/tuning. Gonna mark this one resolved for now, but let's re-open it if 500 proves to be insufficient or if anything unexpected causes the problem to persist.