Project

General

Profile

Bug #14199

Email replies delayed

Added by Matt Gold 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority name:
Normal
Assignee:
Category name:
Email Notifications
Target version:
-
Start date:
2021-03-18
Due date:
% Done:

0%

Estimated time:

Description

Hi Ray, can you please check the email reply queue? I think it's stuck


Related issues

Related to CUNY Academic Commons - Feature #12440: Upgrade CavalcadeResolved2020-02-18

History

#1 Updated by Raymond Hoh 4 months ago

Hi Matt,

Just got your email. You're correct that the email queue was stuck. I usually receive emails whenever the IMAP connection is down, but did not receive one because our scheduling system, Cavalcade, did not fire RBE's bi-hourly task to check if the queue was down for the past week.

I have restarted RBE's IMAP connection.


Boone, there are a lot of tasks with the waiting status in Cavalcade. It looks like this might have started around March 11th if you check "wp cavalcade jobs --site=1 --status=waiting --limit=2000".

I just checked the Cavalcade log to see what might be causing this and there are many "Out of workers" log messages. So next, I checked which tasks are currently running with "wp cavalcade jobs --status=running --limit=2000" and there are quite a few tasks there.

Does Cavalcade need to be restarted?

#2 Updated by Boone Gorges 4 months ago

There's a lot of weird stuff happening in Cavalcade and I'm not sure how to interpret it.

First, I don't think that the 'out of workers' log entries are a concern. I tailed the log for a while, and it looks like plenty of jobs are able to run. You see the 'out of workers' error only sporadically, when some jobs take longer than others and gum up the works. But I think this is normal behavior.

In general, I have a vague sense that Cavalcade is kicking off certain jobs, which marks the jobs as 'running' in the db, but then for one reason or another those jobs are never switched to 'completed' (or 'waiting' for repeating tasks). So you end up with lots of items that are marked 'running', which probably messes to some extent with the cavalcade-runner's "show me the next job to run" query.

In addition, and perhaps in relation to the previous point, certain duplicate jobs have a tendency to be created in 'waiting' status. So you end up with many versions of the same job, going back weeks or months, which never run and are piled up. This may also interfere with the "show me what to run next" query, since the runner tries to get the oldest pending job; but if that job won't run (or won't finish running) for whatever reason, then Cavalcade will try and fail over and over again, and never get to the next item on the list.

I don't have a clear sense of how to test this hypothesis. I want to be conservative about clearing stuff out manually. I took one small step that I think is pretty conservative, where I identified jobs that appear multiple times in the jobs table (where "duplicates" are those that share the same site+hook+args), and I deleted all but the newest:

 12 $dupes = $wpdb->get_results( "select *, count(site), count(hook), count(args) fr    om wp_cavalcade_jobs group by site, hook, args having count(site) > 1 and count(    hook) > 1 and count(args) > 1" );
 13 foreach ( $dupes as $dupe ) {
 14     $items = $wpdb->get_results( $wpdb->prepare( "SELECT * FROM wp_cavalcade_job    s WHERE site = %d AND hook = %s AND args = %s ORDER BY id DESC", $dupe->site, $d    upe->hook, $dupe->args ) );
 15 
 16     $to_keep_id = $items[0]->id;
 17 
 18     $wpdb->query( $wpdb->prepare( "DELETE FROM wp_cavalcade_jobs WHERE site = %d     AND hook = %s AND args = %s AND id != %s", $dupe->site, $dupe->hook, $dupe->arg    s, $to_keep_id ) );
 19 
 20
 21 }                                                                               
 22 
 23 wp_cache_delete( 'jobs', 'cavalcade-jobs' );

Maybe this will clear some of the unrunnable backlog.

#3 Updated by Raymond Hoh 4 months ago

In addition, and perhaps in relation to the previous point, certain duplicate jobs have a tendency to be created in 'waiting' status.

See #13466.


I tried removing RBE's bp_rbe_schedule event and re-added it, but after the scheduled time, the task is still marked as waiting with the nextrun timestamp being the same as the start timestamp.

It just occurred to me that daylight savings time happened over the weekend, which might be related to this problem. I believe this is a reschedule bug with the Cavalcade Runner, which is described here: https://github.com/humanmade/Cavalcade/issues/74#issuecomment-549148047

And addressed with this PR: https://github.com/humanmade/Cavalcade-Runner/pull/64/files

Also related: https://github.com/humanmade/Cavalcade-Runner/issues/51

#4 Updated by Boone Gorges 4 months ago

Ah, good find, Ray. Sounds like the DST issue is probably the one that we're experiencing.

Lihua would have to be responsible for upgrading our instance of cavalcade-runner. My feeling is that we should probably upgrade Cavalcade (mu-plugins) at the same time. I've looked briefly at the changelogs and I don't see any obvious reason why we shouldn't be able to do this. Could you also have a look and let me know what you think? https://github.com/humanmade/Cavalcade-Runner/commits/master https://github.com/humanmade/Cavalcade/commits/master

It looks like Cavalcade has made a number of database query changes since we initially installed. They introduced an upgrade routine https://github.com/humanmade/Cavalcade/blob/master/inc/upgrade/namespace.php but it depends on the cavalcade_db_version database option, which we don't have - I think that our installation predates this. I think this will be fine because the (int) get_site_option() call will result in a 0.

The prudent path forward is probably to do an upgrade of both packages on LDV2 and do some brief testing there. I'll get in touch with Lihua to start this process. Let's move this discussion over to #12440.

For this specific ticket, is there anything else that needs to be done? Ray, I assume you've restarted the RBE queue?

#5 Updated by Raymond Hoh 4 months ago

#6 Updated by Raymond Hoh 4 months ago

  • Status changed from Assigned to Resolved

Could you also have a look and let me know what you think?

In the latest updates for Cavalcade, they added an index for nextrun, but then removed the nextrun index in the most recent version due to performance issues. So when we upgrade, Cavalcade's upgrade routine will add and drop the same nextrun index. rmmcue notes that this could take awhile depending on the size of the tables: https://github.com/humanmade/Cavalcade/pull/107#discussion_r538441971.

Maybe we can just wipe out the upgrade_to_database_4() function so the DROP INDEX command does not run since we're on an older version that predates a database version: https://github.com/humanmade/Cavalcade/blob/e8b1e9a08d242559f82fd9a0eb59e3ea2ef968f0/inc/upgrade/namespace.php#L96-L101

For this specific ticket, is there anything else that needs to be done? Ray, I assume you've restarted the RBE queue?

I've already restarted the RBE queue. Yes, let's handle the Cavalcade specifics in #12440.

#7 Updated by Raymond Hoh 4 months ago

I updated the plugin portion of Cavalcade on production last night. We had a few additional indexes on our wp_cavalcade_jobs table that are not part of Cavalcade's database schema, so I removed them.

I just pinged Lihua so the runner portion will be updated sometime later today.

Also available in: Atom PDF