Nightly system downtime
I'm opening this ticket to track recent outages on the Commons. A number of us receive automated notices when the database server is forced to reboot. Beginning roughly one month ago, we started getting these notices several times per week. These notifications indicate when the DB server is forced to reboot, so they indicate at best the end of a problematic period. I receive additional notifications when requests begin to take an inordinately long time, which serves as an indicator of when the incidents begin.
I've been keeping track of specifics over the last few weeks, and I've discerned the following patterns:
- The most common downtime is just after 05:00 UTC (midnight EST, UTC-5), with my incident reports rolling in sometimes around 12:03am and sometimes around 12:07am.
- Occasionally, the incidents have begun an hour or two earlier, shortly after 03:00 or 04:00 UTC.
- Incidents seem always to begin several minutes after the hour
- Reboots usually take place between 3 and 6 minutes after the beginning of the incident
- On some occasions, the reboots don't seem to fix the underlying issue, and another cycle of slow requests + db reboots immediately follows.
- Sometimes this'll happen a few nights in a row, while sometimes the site will go a few days without any notifications.
All of this strongly suggests that the problem is with an automated cron task, specifically one that takes place around midnight. I've begun to do an analysis of the tasks scheduled for around this time, cross-referencing with the logs. Ideally, we'd be able to narrow down the culprit by identifying the last cron task that begins just before the outages. This is not possible, for a couple reasons: First, the performance issues may only kick in a minute or two after the task begins running (as the system's resources are gradually used up). Second, the Cavalcade logs don't natively keep track of when a task begins running, but only when it finishes (see https://github.com/humanmade/Cavalcade-Runner/blob/master/inc/class-runner.php#L377).
So the best we can do is to make some educated guesses. I'll follow up in a comment with initial thoughts.