Project

General

Profile

Actions

Bug #21034

closed

Commons database server is down

Added by Raymond Hoh 2 months ago. Updated 10 days ago.

Status:
Resolved
Priority name:
Normal
Assignee:
-
Category name:
Server
Target version:
Start date:
2024-09-20
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

According to Hetrixtools, it's been down since 4:55pm ET: https://hetrixtools.com/report/uptime/65b1dc51c2df4bf4b5cf78ba6e7bb4b6/.

The DB server should probably be in a rebooting state and be back up in an hour.


Files

slow_query.log (7.84 KB) slow_query.log Boone Gorges, 2024-09-24 04:59 PM
slow_query.log-20240922 (37.7 KB) slow_query.log-20240922 Boone Gorges, 2024-09-24 04:59 PM
Actions #1

Updated by Colin McDonald 2 months ago

Hi Ray, thanks for noticing and posting. Do you think we should just wait an hour and see if this is resolved on its own, rather than contacting CUNY IT right now?

Actions #2

Updated by Raymond Hoh 2 months ago

The database server usually takes between 1.5 to 2hrs to reboot judging from some of the recent downtimes from July: https://hetrixtools.com/report/uptime/2024-07/65b1dc51c2df4bf4b5cf78ba6e7bb4b6/#downtime_when. Will check in in about half an hour to see if the Commons is back up.

Actions #3

Updated by Raymond Hoh 2 months ago

  • Status changed from New to Resolved

The Commons is back up online.

Actions #4

Updated by Raymond Hoh 2 months ago

Boone, I've added a few robots.txt changes to wp-content/mu-plugins/cac-functions.php on production but haven't committed them yet:

User-agent: AhrefsBot
User-agent: ClaudeBot
User-agent: DataForSeoBot
User-agent: DotBot
User-agent: GPTBot
User-agent: SeekportBot
User-agent: SemrushBot
User-agent: MJ12bot
Crawl-delay: 10

This restricts these bots from crawling a domain to every 10 seconds per page. If we want to be more drastic, we can block them entirely.

The list is based on looking at some of the recent access logs with 'Bot' in the user agent.

Actions #5

Updated by Boone Gorges about 2 months ago

Thanks, Ray. I think it's OK to make these robots.txt changes permanent. I would be more wary of total blocks on any user agent, unless we had a pretty clear indication that those specific agents were causing specific performance problems.

Actions #6

Updated by Raymond Hoh about 2 months ago

I've committed the robots change in https://github.com/cuny-academic-commons/cac/commit/acdb50bc1c2087f57ef5d0b81d4ec5b764429e38.

I would be more wary of total blocks on any user agent, unless we had a pretty clear indication that those specific agents were causing specific performance problems.

That's sensible, though I'd personally like to block all SEO and marketing-related bots :)

Actions #7

Updated by Raymond Hoh about 2 months ago

  • Status changed from Resolved to In Progress

Reopening this one.

Looking through the PHP error log, it looks like once again Editoria11y is the culprit:

[Tue Sep 24 12:05:01 2024] [notice] [pid 66265] sapi_apache2.c(349): [client 182.48.91.196:38982] WordPress database error Query execution was interrupted for query SELECT option_value FROM wp_34426_options WHERE option_name = 'uninstall_plugins' LIMIT 1 made by require('wp-blog-header.php'), require_once('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), include_once('/plugins/editoria11y-accessibility-checker/editoria11y.php'), register_uninstall_hook, get_option

Editoria11y's uninstall routine loops through the entire network first 10,000 sites to try to remove all Editoria11y tables from the network:

https://github.com/itmaybejj/editoria11y-wp/blob/bc05516f7e582159da6cfc28728ca594a1112f5d/editoria11y.php#L309-L315

When the site is back up, we'll need to adjust this uninstall routine.

Actions #8

Updated by Boone Gorges about 2 months ago

Unreal. I've got to run to a meeting, but can you please short-circuit this in production while we figure out what to do longer term?

Actions #9

Updated by Raymond Hoh about 2 months ago

I've limited Editoria11y's uninstall routine to just the current site in https://github.com/cuny-academic-commons/cac/commit/fc928dcff406cfb135cf882d5f070527ab2b2552. I've also pushed the fix to production.

Actions #10

Updated by Matt Gold about 2 months ago

FYI, the Commons status blog says that site is back up, but I still see an under maintenance message

Actions #11

Updated by Marilyn Weber about 2 months ago

Still down, alas.

Actions #12

Updated by Raymond Hoh about 2 months ago

When I updated the status blog, the site was back up online, but the site went down again shortly afterwards and is back up again. So it appears that something else is the cause of the downtime.

I'm currently away from my PC and won't be able to look at this for at least another hour or so. If Boone is around, hopefully he can jump in.

Actions #13

Updated by Boone Gorges about 2 months ago

Ray, thanks for reviewing so far.

Yiu Ming sent along the attached logs. There's no single query that's triggering anything in them. What I see is a handful of standard-issue slow queries (10-12 seconds), and then a series of ~60 second queries. The latter are clustered together in such a way that I believe they are a symptom of overall database slowness, and don't directly indicate any problems.

The site appears to be running OK, and has been for the last 30 minutes or so. From this I'm guessing that Ray correctly identified the Editoria11y bug as the source of the problem, but that when the database server restarted, there were still some pending queries that needed to be flushed out of the system, triggering a second reboot. (This tends to happen when it's only the database server that restarts, since certain Apache processes may remain open.)

I'm going to work up a ticket for the Editoria11y plugin and I'll post here when I have a link.

Actions #14

Updated by Boone Gorges about 2 months ago

I've opened a PR for the Editoria11y repo: https://github.com/itmaybejj/editoria11y-wp/pull/35

Actions #15

Updated by Raymond Hoh about 2 months ago

Thanks for opening the PR, Boone!

Actions #17

Updated by Boone Gorges 10 days ago

  • Status changed from In Progress to Resolved
  • Target version set to Not tracked
Actions

Also available in: Atom PDF