Project

General

Profile

Feature #8987

Migrate away from wp-cron

Added by Boone Gorges almost 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority name:
High
Assignee:
Category name:
-
Target version:
Start date:
2017-12-07
Due date:
% Done:

0%

Estimated time:

Description

I suspect that our persistent performance issues are somehow linked to cron jobs. A few pieces of evidence:

1. Outages seem to take place on the hour (1pm, 3pm, etc), when WP cron jobs are more likely to be run.
2. When I put monitoring tools in place to track long-running queries, wp-cron.php requests are some of the most significant culprits
3. The wp-cron system is known to scale quite poorly on multisite, where it's likely to trigger locking and concurrency issues

Even if it turns out that this is not the culprit, it would not hurt to move to something more robust.

Ray, as a starting point, I was wondering whether you had experience with alternative systems. Something like https://engineering.hmn.md/projects/cavalcade/ seems ideal but will obviously take some config and testing, which I don't want to devote if you think there might be a simpler fix.

cavalcade-migrate-all.sh (111 Bytes) cavalcade-migrate-all.sh Boone Gorges, 2018-06-05 05:22 PM
cavalcade-migrate.php (845 Bytes) cavalcade-migrate.php Boone Gorges, 2018-06-05 05:23 PM

Related issues

Related to CUNY Academic Commons - Bug #9865: Broken Link Checker cron jobs running longResolved2018-05-31

Related to CUNY Academic Commons - Bug #6737: wp-rss-multi-importer cron job requires huge number of DB queriesResolved2016-11-15

Related to CUNY Academic Commons - Bug #9926: twitter-mentions-as-comments cron jobs can run longNew2018-06-13

Related to CUNY Academic Commons - Bug #9929: External Group Blogs cron reviewResolved2018-06-14

Related to CUNY Academic Commons - Bug #9930: wp_privacy_delete_old_export_files runs a bazillion timesResolved2018-06-14

History

#1 Updated by Boone Gorges over 3 years ago

  • Priority name changed from Normal to High

After some performance issues today, I'm more convinced that cron may be at the root of many of our recent issues. I'm going to prioritize this issue for January.

#2 Updated by Boone Gorges over 3 years ago

  • Target version changed from Not tracked to 1.12.6

Placing in a milestone so it doesn't get lost - I don't think we can get anything in place by 1.12.6, but maybe we can get a testing rig set up by then.

#3 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.6 to 1.12.7

#4 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.7 to 1.12.8

#5 Updated by Raymond Hoh over 3 years ago

Calvacade looks promising, but it does require some server setup. On the call I mentioned nginx, but I meant to say PHP's process control extension! :)

The easier fix is disabling WP cron and running a specialized WP-CLI script to trigger WP cron ourselves like this one:
https://wordpress.stackexchange.com/a/239257

However, we have a rather, large multisite install. If we decide to run that WP-CLI script, how often would we run it? That SE answer recommends every minute.

#6 Updated by Boone Gorges over 3 years ago

Thanks for looking, Ray!

During my research I saw lots of scripts like the one you referenced, but it seems to me to solve the wrong problem. The problem that many multisite installations have with wp-cron is that subsites are not visited frequently enough to trigger cron jobs reliably; firing a request at each one every minute helps solve that. But I think our problem is that certain sites have so many cron jobs (or cron jobs of such a nature) that the cron-triggering requests are held open for too long a time. A one-per-minute cron run of all sites wouldn't help with that - and might make it worse.

The thing I like about Cavalcade (as I understand it) is that it moves jobs out into a central repository, with a single daemon that monitors for jobs across an entire network. There would be no concurrency issues with, say, 100 sites attempting to pull up the cron array at a time.

I'm going to spend time next week setting up Cavalcade locally to try to understand its setup process better. Then hopefully we can convince Lihua to set it up in cdev for us, so that we can run some tests.

#7 Updated by Raymond Hoh over 3 years ago

Yeah, I wasn't sure about the overall benefits of the WP-CLI script.

I also found another cron alternative by Automattic that is used on their VIP cluster:
https://github.com/Automattic/Cron-Control

Similar to Cavalcade in that it uses a queue, but requires PHP 7.

It uses a Golang daemon or REST API endpoints. Here's a link to the runner:
https://github.com/Automattic/Cron-Control/tree/master/runner

At a glance, if we just used their REST API version, I wouldn't really think REST API endpoints would increase the reliability of running cron events, but the Golang daemon is interesting.

I also don't really see signs of wide multisite support.

#8 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.8 to 1.12.9

I've done some initial research on this and pinged Lihua about it. I'll continue to punt this ticket while we work out the details.

#9 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.9 to 1.12.10

#10 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.10 to 1.12.11

#11 Updated by Boone Gorges over 3 years ago

An update. Lihua has kindle installed Cavalcade on LDV1, where I've been running some testing.

The first problem I've hit is related to MySQL time zones. MySQL's time_zone setting on the GC cluster is -05:00 (or whatever local time is). This causes Cavalcade's NOW checks for next events to be off by a couple of hours, since the dates themselves are stored in UTC. This can be fixed by setting the `time_zone` setting on connection in the Cavalcade runner. See https://github.com/humanmade/Cavalcade/issues/55. I can't figure out how to create a proper Cavalcade plugin, so for the moment I've just put the change directly in the runner.

Second problem is that the runner eats up all the server memory. After some debugging, it looks like the problem is the Jetpack Sync module, which tries to do non-stop syncing for all enabled sites when Cavalcade runs. It turns out the problem has already been reported: https://github.com/humanmade/Cavalcade/issues/29 https://github.com/Automattic/jetpack/issues/5513. My next step will likely be to patch our local fork to see if this fixes the issue.

More to come as I find more time to work on this. I wanted to leave these notes to help myself remember where I left off.

#12 Updated by Raymond Hoh over 3 years ago

  • Target version changed from 1.12.11 to 1.12.12

#13 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.12 to 1.12.13

#14 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.12.13 to 1.13.2

#15 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.13.2 to 1.13.3

#16 Updated by Boone Gorges over 3 years ago

After a nudge from Lihua, I have started to look at this again.

The first problem I've hit is related to MySQL time zones. MySQL's time_zone setting on the GC cluster is -05:00 (or whatever local time is). This causes Cavalcade's NOW checks for next events to be off by a couple of hours, since the dates themselves are stored in UTC. This can be fixed by setting the `time_zone` setting on connection in the Cavalcade runner. See https://github.com/humanmade/Cavalcade/issues/55. I can't figure out how to create a proper Cavalcade plugin, so for the moment I've just put the change directly in the runner.

The problem has been fixed upstream, so I updated our cavalcade-runner. It looks to be working properly.

Second problem is that the runner eats up all the server memory. After some debugging, it looks like the problem is the Jetpack Sync module, which tries to do non-stop syncing for all enabled sites when Cavalcade runs. It turns out the problem has already been reported: https://github.com/humanmade/Cavalcade/issues/29 https://github.com/Automattic/jetpack/issues/5513. My next step will likely be to patch our local fork to see if this fixes the issue.

I've applied dd32's patch and rebuilt the jobs database table, and it does seem to address the problem.

I'm doing some light testing on the dev environment and things are generally looking good. Cron jobs are being fired on schedule, and wp-cron.php is not being called at all.

Ray, do you have any ideas about how to stress-test this before dropping it into production? Would you mind playing with it on LDV1 a bit to make sure it's working as you'd expect? Note that you'll need to run the cavalcade utility in an SSH session, and leave it running, while you do your testing in the web browser. Once you've done a sanity check on my own tests, we can ask Lihua to flip the daemon back on, run another round of quick tests, and then come up with a plan for migrating production cron jobs (or not??) and testing it there.

#17 Updated by Raymond Hoh over 3 years ago

I've tested Cavalcade for a little bit and it appears to be working.

I have some questions about the runner though. How do we know if there is an existing Cavalcade runner already running? There doesn't appear to be a wp-cli command to determine its current status.

To test, I ran the nohup cavalcade command twice and they both show up as separate processes - ps aux | grep cavalcade. (I guess I answered my own question -- by checking ps aux!)

The other thing is if the runner gets killed for whatever reason, we would have to re-initialize the runner. Could probably write a bash script tied to system cron to see if the Cavalcade runner is running and if not, we run it again?

Similarly, if the runner gets killed when a job is running and it is not completed, the jobs database would need to be altered. See https://github.com/humanmade/Cavalcade/issues/31

About stress-testing, I guess my concern is the initial addition of all the jobs across the network. After that would be the performance of the runner, but I think the runner looks to be well-coded.

#18 Updated by Boone Gorges over 3 years ago

Thanks for looking, Ray!

The idea is that we'll package the runner as a systemd service, at which point systemd will be responsible for daemon integrity. We will be using a modified version of https://github.com/humanmade/Cavalcade-Runner/blob/master/systemd.service. Notice Restart=always and https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

Similarly, if the runner gets killed when a job is running and it is not completed, the jobs database would need to be altered.

Yes, that's interesting. We don't currently have any monitoring of failed jobs on wp-cron, so for all we know something similar could be happening now, but it's something to be aware of. Perhaps we could mitigate the potential fallout by having a pretty small number of workers - say, 5 - so that, at worst, you have 5 orphaned jobs.

About stress-testing, I guess my concern is the initial addition of all the jobs across the network.

You mean the migration itself? I guess I'd probably shut the site down for the migration process, less for performance reasons than for concurrency - a task that gets scheduled after migration begins might not be picked up properly by the migration routine.

Next steps:
- I'll contact Lihua to have him reenable the Cavalcade-Runner daemon on LDV1 so that we can be sure that my patches resolve outstanding issues.
- I'll write a migration routine that moves wp-cron jobs to Cavalcade. This'll have to be in the form of a wp-cli command that runs on each individual site, since many cron jobs (and especially schedules) are defined by plugins that will only be available when launching a specific subsite.

#19 Updated by Boone Gorges over 3 years ago

I've contacted Lihua.

I've also written migration scripts, which I've attached here. cavalcade-migrate.php is meant to be run with eval-file on a single site. cavalcade-migrate-all will run the script on all sites in the network.

#20 Updated by Boone Gorges over 3 years ago

  • Target version changed from 1.13.3 to 1.13.4

This is very close to being ready to go. Status:

- I've made some modifications to the migration scripts (as stored on ldv1 /home/bgorges/wp-cli-scripts) to exclude certain problematic legacy cron jobs from the migration, and also to ensure that no cron jobs are set in the past. (The fact that there are pending past jobs is an indication of bugs in the current system. I've written the script so that pending old jobs are scheduled for a random time a few hours after the migration, to avoid resource problems.)
- I'm going to add another bit of logic to the migration script that prevents a pending task from being migrated if there's already a task scheduled on that site by the same name.

The latter item is imperfect, but it will allow me to avoid shutting down for several hours during the Cavalcade migration. Some plugins (annoyingly) check for scheduled jobs on every single pageload. This means that, after switching out native cron with Cavalcade (but before the migrator has reached that specific site), the plugin will schedule its cron jobs. And when the migrator reaches that blog, it will schedule another version of the jobs. In some cases, this can cause duplicates. There are probably more elegant ways around this, but this is the fastest and least disruptive way forward.

I'll likely schedule an evening this week to do the migration.

#21 Updated by Boone Gorges over 3 years ago

  • Related to Bug #9865: Broken Link Checker cron jobs running long added

#22 Updated by Boone Gorges over 3 years ago

  • Related to Bug #6737: wp-rss-multi-importer cron job requires huge number of DB queries added

#23 Updated by Boone Gorges over 3 years ago

Update: I'm planning to do the switch in production tonight at 10pm EDT. There should be no downtime, but just in case, I've posted a status update to our blog: https://wordpress.com/post/commonsstatus.wordpress.com/87

#24 Updated by Boone Gorges over 3 years ago

  • Related to Bug #9926: twitter-mentions-as-comments cron jobs can run long added

#25 Updated by Boone Gorges over 3 years ago

Update: I did the migration tonight and things seem to be running well. There was a resource spike for a few minutes after the switch as Cavalcade tried to keep up with all the old (failed) scheduled tasks. But once it caught up, it's only firing a handful of jobs per minute. And note that this is a handful of individual jobs, rather than the wp-cron system of global locks.

The migration task for existing jobs takes a long time to run, so I've put it into a background process and will check in the morning to ensure it's finished.

For the time being, the cavalcade process is running in the background on lwb1 and lwb2:

[CAC lw3b] ~ [1000]$ ps aux | grep cavalca
bgorges  23059  0.2  0.0 417416 19092 ?        S    23:01   0:03 php /usr/local/bin/cavalcade .

[CAC lw3a] ~ [1000]$ ps aux | grep cavalca
bgorges   12454  0.1  0.0 417040 18464 pts/0    S    22:59   0:02 php /usr/local/bin/cavalcade /var/www/html/commons/www

I'm running them via nohup at the moment so that I'm able to kill/restart if necessary. After monitoring for a few days, I'll have Lihua turn them on via systemd.

#26 Updated by Matt Gold over 3 years ago

This sounds promising. Thanks so much, Boone

#27 Updated by Boone Gorges over 3 years ago

  • Related to Bug #9929: External Group Blogs cron review added

#28 Updated by Boone Gorges over 3 years ago

  • Related to Bug #9930: wp_privacy_delete_old_export_files runs a bazillion times added

#29 Updated by Boone Gorges over 3 years ago

I've spent the morning cleaning up many out-of-control cron jobs. Some relevant changesets:

https://github.com/cuny-academic-commons/cac/commit/0fa94e53bedf9f590b000f713bb29001c5beb10a
https://github.com/cuny-academic-commons/cac/commit/66ada4a1de8179b7f923ab363a29d76e348b7e91
https://github.com/cuny-academic-commons/cac/commit/65fa1fd9ee9ffe2cf1b5157247f3f3fc2cdbe07a
https://github.com/cuny-academic-commons/cac/commit/5a5d466950a4d2df4fdab9ba6407333e5480fc03
https://github.com/cuny-academic-commons/cac/commit/472a782752e94b2ae6cd76ff42fdf9cc06403f22
https://github.com/cuny-academic-commons/cac/commit/942e9a3db7770075319aa90612b25efaa621a486
https://github.com/cuny-academic-commons/cac/commit/8d204f230058d70208d62beaa022f2e2e3f27a9b

Most of these are accompanied by some manual mods to existing cron jobs in the wp_cavalcade_jobs table.

I've identified a BuddyPress-related job that would benefit from an upstream fix: https://buddypress.trac.wordpress.org/ticket/7904#ticket. There's a temporary fix on the Commons in https://github.com/cuny-academic-commons/cac/commit/e1d28c57988a1e40fa24c6f285869fbbbcc53e76

Cavalcade jobs are way way down now - between 20 and 30 per minute - and resource utilization (load average, etc) is down at a reasonable level. I'm going to stop working on this for today and monitor behavior for the rest of the day.

#30 Updated by Raymond Hoh over 3 years ago

I've been taking a look at the recent commits and I have to say thanks for all your detective work in hunting these cron jobs down, Boone!

#31 Updated by Boone Gorges over 3 years ago

Thanks, Ray!

It's an interesting problem. With wp-cron, cron jobs are liable to run very long and lock up the system, because so many of them get crowded together into a single request. Once the jobs are split into smaller units, this problem no longer happens, but now we suddenly have more specific instances of jobs running. And the Cavalcade model is that one request gets fired for each job, which is to say that WP gets loaded separately for each job, and as such, many small cron jobs can add up to an increase in resources if you're not careful.

The flip side of this is that many plugins err on the side of having their jobs run too frequently because wp-cron is so terrible, and because adding another cron job on top of a zillion others doesn't have much of a perceptible effect. But once the jobs are broken up, they run much more reliably, and as a result it's not necessary to run them so frequently.

Anyway, I'm going to continue to work on this, and I may write something up when I'm done.

#33 Updated by Boone Gorges about 3 years ago

  • Tracker changed from Bug to Feature
  • Status changed from New to Resolved

As of last week, Cavalcade is running as a daemon on the production sites. Yesterday, we had our first real test of daemon restarts, and while it took about 12 hours, systemd did manage to restart the service. I'll work with Lihua to identify why it takes so long.

This ticket has uncovered a bunch of imperfections in cron routines throughout the Commons, but these are independent bugs that have always existed and have just been brought to light by the current change.

I'll do a writeup on the Cavalcade migration process - tips and pitfalls, etc - sometime in July.

As for this ticket, I'm going to mark Resolved as the initial implementation is complete.

Also available in: Atom PDF