Bug #21034
closedCommons database server is down
0%
Description
According to Hetrixtools, it's been down since 4:55pm ET: https://hetrixtools.com/report/uptime/65b1dc51c2df4bf4b5cf78ba6e7bb4b6/.
The DB server should probably be in a rebooting state and be back up in an hour.
Files
Updated by Colin McDonald 2 months ago
Hi Ray, thanks for noticing and posting. Do you think we should just wait an hour and see if this is resolved on its own, rather than contacting CUNY IT right now?
Updated by Raymond Hoh 2 months ago
The database server usually takes between 1.5 to 2hrs to reboot judging from some of the recent downtimes from July: https://hetrixtools.com/report/uptime/2024-07/65b1dc51c2df4bf4b5cf78ba6e7bb4b6/#downtime_when. Will check in in about half an hour to see if the Commons is back up.
Updated by Raymond Hoh 2 months ago
- Status changed from New to Resolved
The Commons is back up online.
Updated by Raymond Hoh 2 months ago
Boone, I've added a few robots.txt changes to wp-content/mu-plugins/cac-functions.php
on production but haven't committed them yet:
User-agent: AhrefsBot User-agent: ClaudeBot User-agent: DataForSeoBot User-agent: DotBot User-agent: GPTBot User-agent: SeekportBot User-agent: SemrushBot User-agent: MJ12bot Crawl-delay: 10
This restricts these bots from crawling a domain to every 10 seconds per page. If we want to be more drastic, we can block them entirely.
The list is based on looking at some of the recent access logs with 'Bot' in the user agent.
Updated by Boone Gorges about 2 months ago
Thanks, Ray. I think it's OK to make these robots.txt changes permanent. I would be more wary of total blocks on any user agent, unless we had a pretty clear indication that those specific agents were causing specific performance problems.
Updated by Raymond Hoh about 2 months ago
I've committed the robots change in https://github.com/cuny-academic-commons/cac/commit/acdb50bc1c2087f57ef5d0b81d4ec5b764429e38.
I would be more wary of total blocks on any user agent, unless we had a pretty clear indication that those specific agents were causing specific performance problems.
That's sensible, though I'd personally like to block all SEO and marketing-related bots :)
Updated by Raymond Hoh about 2 months ago
- Status changed from Resolved to In Progress
Reopening this one.
Looking through the PHP error log, it looks like once again Editoria11y is the culprit:
[Tue Sep 24 12:05:01 2024] [notice] [pid 66265] sapi_apache2.c(349): [client 182.48.91.196:38982] WordPress database error Query execution was interrupted for query SELECT option_value FROM wp_34426_options WHERE option_name = 'uninstall_plugins' LIMIT 1 made by require('wp-blog-header.php'), require_once('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), include_once('/plugins/editoria11y-accessibility-checker/editoria11y.php'), register_uninstall_hook, get_option
Editoria11y's uninstall routine loops through the entire network first 10,000 sites to try to remove all Editoria11y tables from the network:
When the site is back up, we'll need to adjust this uninstall routine.
Updated by Boone Gorges about 2 months ago
Unreal. I've got to run to a meeting, but can you please short-circuit this in production while we figure out what to do longer term?
Updated by Raymond Hoh about 2 months ago
I've limited Editoria11y's uninstall routine to just the current site in https://github.com/cuny-academic-commons/cac/commit/fc928dcff406cfb135cf882d5f070527ab2b2552. I've also pushed the fix to production.
Updated by Matt Gold about 2 months ago
FYI, the Commons status blog says that site is back up, but I still see an under maintenance message
Updated by Raymond Hoh about 2 months ago
When I updated the status blog, the site was back up online, but the site went down again shortly afterwards and is back up again. So it appears that something else is the cause of the downtime.
I'm currently away from my PC and won't be able to look at this for at least another hour or so. If Boone is around, hopefully he can jump in.
Updated by Boone Gorges about 2 months ago
- File slow_query.log-20240922 slow_query.log-20240922 added
- File slow_query.log slow_query.log added
Ray, thanks for reviewing so far.
Yiu Ming sent along the attached logs. There's no single query that's triggering anything in them. What I see is a handful of standard-issue slow queries (10-12 seconds), and then a series of ~60 second queries. The latter are clustered together in such a way that I believe they are a symptom of overall database slowness, and don't directly indicate any problems.
The site appears to be running OK, and has been for the last 30 minutes or so. From this I'm guessing that Ray correctly identified the Editoria11y bug as the source of the problem, but that when the database server restarted, there were still some pending queries that needed to be flushed out of the system, triggering a second reboot. (This tends to happen when it's only the database server that restarts, since certain Apache processes may remain open.)
I'm going to work up a ticket for the Editoria11y plugin and I'll post here when I have a link.
Updated by Boone Gorges about 2 months ago
I've opened a PR for the Editoria11y repo: https://github.com/itmaybejj/editoria11y-wp/pull/35
Updated by Boone Gorges about 2 months ago
The maintainer accepted my PR: https://github.com/itmaybejj/editoria11y-wp/pull/35#issuecomment-2374014411
Updated by Boone Gorges 9 days ago
- Status changed from In Progress to Resolved
- Target version set to Not tracked