Project

General

Profile

Actions

Bug #21886

open

Diagnose problems with action_scheduler tables

Added by Boone Gorges 12 days ago. Updated 1 day ago.

Status:
New
Priority name:
Normal
Assignee:
Category name:
-
Target version:
Start date:
2025-01-27
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

During the Reclaim migration, we had some problems with database replication. These problems seem to stem from tables related to the action_scheduler library that's used by a number of WP plugins. Log output from Reclaim:

Jan 09 18:08:20 node15377-cunyacprod-1.us.reclaim.cloud mariadbd[110663]: 2025-01-09 18:08:20 1131900 [ERROR] InnoDB: Refusing to load './wp_1153337/wp_site.ibd' (id=10514, flags=0x15); dictionary contains id=10504, flags=0x15
Jan 09 18:08:20 node15377-cunyacprod-1.us.reclaim.cloud mariadbd[110663]: 2025-01-09 18:08:20 1131900 [ERROR] InnoDB: Could not find a valid tablespace file for wp_1153337/wp_site. Please refer to https://mariadb.com/kb/en/innodb-data-dictionary-troubleshooting/ for how to resolve the issue.
Jan 09 18:08:55 node15377-cunyacprod-1.us.reclaim.cloud mariadbd[110663]: 2025-01-09 18:08:55 1131941 [ERROR] InnoDB: Refusing to load './wp_1153337/wp_7971_actionscheduler_groups.ibd' (id=488990, flags=0x15); dictionary contains id=488979, flags=0x15
Jan 09 18:08:55 node15377-cunyacprod-1.us.reclaim.cloud mariadbd[110663]: 2025-01-09 18:08:55 1131941 [ERROR] InnoDB: Could not find a valid tablespace file for wp_1153337/wp_7971_actionscheduler_groups. Please refer to https://mariadb.com/kb/en/innodb-data-dictionary-troubleshooting/ for how to resolve the issue.

Examining some problematic tables, you can see that MySQL thinks that they exist, but it's not possible to access them due to "tablespace is missing" errors:

MariaDB [wp_1153337]> check table wp_10360_actionscheduler_groups;
+--------------------------------------------+-------+----------+------------------------------------------------------------------------------+
| Table                                      | Op    | Msg_type | Msg_text                                                                     |
+--------------------------------------------+-------+----------+------------------------------------------------------------------------------+
| wp_1153337.wp_10360_actionscheduler_groups | check | Warning  | Tablespace is missing for table 'wp_1153337/wp_10360_actionscheduler_groups' |
| wp_1153337.wp_10360_actionscheduler_groups | check | Error    | Got error 194 "Tablespace is missing for a table" from storage engine InnoDB |
| wp_1153337.wp_10360_actionscheduler_groups | check | error    | Corrupt                                                                      |
+--------------------------------------------+-------+----------+------------------------------------------------------------------------------+
3 rows in set (0.000 sec)

MariaDB [wp_1153337]> select count(*) from wp_10360_actionscheduler_groups;
ERROR 1030 (HY000): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
MariaDB [wp_1153337]> show tables like 'wp_10360_%';
+-----------------------------------+
| Tables_in_wp_1153337 (wp_10360_%) |
+-----------------------------------+
| wp_10360_actionscheduler_actions  |
| wp_10360_actionscheduler_claims   |
| wp_10360_actionscheduler_groups   |
| wp_10360_actionscheduler_logs     |
| wp_10360_commentmeta              |
| wp_10360_comments                 |
| wp_10360_links                    |
| wp_10360_options                  |
| wp_10360_postmeta                 |
| wp_10360_posts                    |
| wp_10360_term_relationships       |
| wp_10360_term_taxonomy            |
| wp_10360_termmeta                 |
| wp_10360_terms                    |
+-----------------------------------+
14 rows in set (1.034 sec)

We need to track down the issue and figure out a way to patch the plugins using actionscheduler so that this does not happen in the future, because currently it's breaking the entire replication process.

Actions #1

Updated by Boone Gorges 9 days ago

I've spent most of the day on this today and I still have no idea what's happening. Going to record my thoughts, but then I have to leave it behind for the time being.

- I ran a script to identify affected sites. I've attached the script, as well as a list of confirmed-bad tables.
- The numerically highest table here is 35766, a site that was created before the migration to Reclaim. This is not a very reliable data point, since the date of table creation doesn't necessarily match the date of blog creation. (A very old blog might have activated an actionscheduler-using plugin just yesterday.) But it's a data point.
- Affected blog IDs are low and high. Doesn't seem to be any real distribution.
- 150 "groups" tables are bad, while another 20961 are good
- Why do we have so many of these tables? Something is clearly going on, because only a small number of sites have the-events-calendar or a related plugin activated.
- Interestingly, none of the 150 "bad" sites have the-events-calendar or any of the other action-scheduler-using plugins activated. I don't have evidence that the suspiciously high number of AS tables is related to the suspicously corrupted tables, but it feels like there might be some connection.
- I checked to see whether there might be a bad template site, but I confirmed in cac-site-templates and we copy tables from a whitelist. Cloning would not be so selective, but this would not explain 20000 sites.
- I set up a logger to catch the creation of actionscheduler tables. the-events-calendar and action-scheduler are both labrynthine and I find it very confusing to see how they're created.
- I had an extensive discussion with ChatGPT about potential causes. Corrupted .ibd files are almost always the source of this kind of 'tablespace' error, and it seems highly unlikely that a normal WP operation could result in this kind of corruption - it's something that generally happens because the physical file is deleted or broken. One possibility is that some sort of actionscheduler operation, either in the course of normal WP operation or as part of some system routine (like a backup) was taking place during one of our many database crashes on the legacy server. In fact, one theory I had is that some sort of bad actionscheduler query was the cause of our database breakage: the database would lock up due to something weird happening with one of these tables; the automated reboot sequence would kick in; and the table would be corrupted as a result. ChatGPT had a hard time thinking of a specific set of circumstances that would make this happen, and I don't have any data to be able to confirm or deny it.
- I reviewed uninstall and deactivate routines to see if some plugin might be doing something odd that would result in a partially-dropped table. I didn't find anything, but I also find these plugins very hard to understand.

My next step is to wait a day or two to see if AS tables are dropped/created, which should add some diagnostic info to my logs. Based on this, we can think about where to go next.

Actions #2

Updated by Raymond Hoh 6 days ago

Four plugins are currently using Action Scheduler:

-pdf-embedder
-the-events-calendar
-widget-for-eventbrite-api
-wp-mail-smtp

Why do we have so many of these tables? Something is clearly going on, because only a small number of sites have the-events-calendar or a related plugin activated.

but this would not explain 20000 sites

20,000 is way too many! I checked for older plugins that had action-scheduler in their code with git log --follow -- *action-scheduler* and I can only find one: popup-maker. It's not a very popular plugin though as only 45 sites have it currently activated.

I think we could consider deleting the action-scheduler tables for sites that do not have any of the four plugins listed above activated.

Actions #3

Updated by Boone Gorges 6 days ago

I checked my log today and found just one set of actionscheduler tables being created since I put the logger in place. It was the clone of a site that had the tables. IMO this is a situation we shouldn't be touching for now.

I think we could consider deleting the action-scheduler tables for sites that do not have any of the four plugins listed above.

Yeah, this would be a helpful starting point. I just didn't want to take that step without having a sense of how the tables got created in the first place.

I decided to look at sites created since the migration that have actionscheduler tables. There are 29. I then cross-referenced against the list of plugins containing the actionscheduler libraries:

MariaDB [wp_1153337]> select * from wp_network_plugins where site_id in ( 36745, 36753, 36763, 36767, 36769, 36917, 36919, 36941, 36973, 36977, 36979, 36981, 36987, 36989, 37005, 37007, 37015, 37017, 37023, 37025, 37027, 37029, 37031, 37033, 37043, 37049, 37069, 37115, 37119 ) and ( plugin like '%pdf-embedder%'  OR plugin like '%the-events-calendar%' OR plugin like '%widget-for-eventbrite-api%' OR plugin like '%wp-mail-smtp%' );
+--------+---------+---------------------------------------------+
| id     | site_id | plugin                                      |
+--------+---------+---------------------------------------------+
| 121311 |   36763 | pdf-embedder/pdf_embedder.php               |
| 122757 |   36973 | pdf-embedder/pdf_embedder.php               |
| 122815 |   36977 | pdf-embedder/pdf_embedder.php               |
| 122887 |   36979 | pdf-embedder/pdf_embedder.php               |
| 122909 |   36979 | the-events-calendar/the-events-calendar.php |
| 122973 |   36981 | pdf-embedder/pdf_embedder.php               |
| 122995 |   36981 | the-events-calendar/the-events-calendar.php |
| 123371 |   37023 | pdf-embedder/pdf_embedder.php               |
| 123447 |   37027 | pdf-embedder/pdf_embedder.php               |
| 123521 |   37029 | pdf-embedder/pdf_embedder.php               |
| 123595 |   37031 | pdf-embedder/pdf_embedder.php               |
| 123669 |   37033 | pdf-embedder/pdf_embedder.php               |
| 123901 |   37069 | the-events-calendar/the-events-calendar.php |
| 124021 |   37119 | pdf-embedder/pdf_embedder.php               |
+--------+---------+---------------------------------------------+
14 rows in set (0.002 sec)

14/29 is not really definitive, but it does suggest that the tables are coming from somewhere unwanted. It seems unlikely that 15 sites would have activated one of the plugins (thus creating the tables) and then deactivated.

Ray, unless you have a better idea, let's wait another few days to see if more instances appear in the logs.

Actions #4

Updated by Raymond Hoh 6 days ago

pdf-embedder shouldn't be using action-scheduler unless telemetry is opted-in by the site admin (see #20088). According to their changelog, v4.8.0 should have ensured that pdf-embedder is only loading the action-scheduler library when necessary, but it looks like this isn't working correctly based on the latest action-scheduler site DB tables. Will have to look into this.

I checked my log today and found just one set of actionscheduler tables being created since I put the logger in place. It was the clone of a site that had the tables.

Interesting about the clone. Like you said, let's wait for more instances of this occurring, but the site cloning would explain the unaccounted numbers (although it would be a huge number!).

Actions #5

Updated by Raymond Hoh 4 days ago

pdf-embedder shouldn't be using action-scheduler unless telemetry is opted-in by the site admin (see #20088). According to their changelog, v4.8.0 should have ensured that pdf-embedder is only loading the action-scheduler library when necessary, but it looks like this isn't working correctly based on the latest action-scheduler site DB tables. Will have to look into this.

I can confirm that pdf-embedder is working as intended; Action Scheduler DB tables are only installed if the usage tracking option is turned on. In 2.5.x branch, I've forced pdf-embedder's usage tracking option to always be off in https://github.com/cuny-academic-commons/cac/commit/36da93e3adbd4a8fc66db0b31555639ae2191035. This is to prevent future pdf-embedder installations from triggering the creation of the Action Scheduler DB tables.

Just a note for Boone when we decide to delete all Action Scheduler DB tables from sites that do not have an AS plugin enabled, the schema-ActionScheduler_StoreSchema and schema-ActionScheduler_LoggerSchema options will also need to be deleted. These are markers that Action Scheduler uses to determine whether or not to install their DB tables. So if a site decides to reinstall one of the AS plugins, this will allow AS's tables to be created.

Actions #6

Updated by Raymond Hoh 4 days ago

I checked my log today and found just one set of actionscheduler tables being created since I put the logger in place. It was the clone of a site that had the tables.

I was trying to determine the number of cloned sites we have, but I noticed that we do not keep track of whether a site is a clone in the cac_handle_site_clone() function. We should keep track of cloned sites like we do for groups in the cac_handle_group_clone() function and use either the wp_blogmeta or wp_bp_user_blogs_blogmeta DB table to record the site ID of the parent site.

Actions #7

Updated by Jeremy Felt 4 days ago

Some additional, likely not consequential data...

  • The current release of Action Scheduler is 3.9.2
  • The Events Calendar uses 3.8.1 (3.7.4, 3.7.1, 3.6.4, 3.6.0, ..... previously)
  • PDF Embedder uses 3.7.4 (3.9.0 in latest release, 3.7.2 originally)
  • WP Mail SMTP uses 3.7.4 (3.7.1, 3.6.1, 3.5.4, 3.4.2, 3.4.0, 3.2.1, and 3.1.6 previously)
  • Widget for Eventbrite API uses 3.4.2

There seem to have been schema changes in Action Scheduler 3.3.0, 3.4.0, and 3.6.0. I could imagine a series of events where conflicting schema updates ended up corrupting something, but that's a stretch here - at first glance, the schema updates weren't complicated.

The many thousands of tables on sites without any of these plugins activated makes me suspicious of a switch_to_blog issue somewhere. Related: https://redmine.gc.cuny.edu/issues/15194

This code is all very complicated, but it seems like action_scheduler_before_schema_update is an action that fires whenever tables are created or the schema is updated. We could use that to backtrace the chain of code that led to that action as well as the status of the _wp_switched_stack global to see if/when it happens in a switched context.

Actions #8

Updated by Boone Gorges 4 days ago

Thanks to both of you for your thoughts.

In 2.5.x branch, I've forced pdf-embedder's usage tracking option to always be off

Good call. This is deployed.

Just a note for Boone when we decide to delete all Action Scheduler DB tables from sites that do not have an AS plugin enabled, the schema-ActionScheduler_StoreSchema and schema-ActionScheduler_LoggerSchema options will also need to be deleted. These are markers that Action Scheduler uses to determine whether or not to install their DB tables. So if a site decides to reinstall one of the AS plugins, this will allow AS's tables to be created.

Thanks - I'll circle back to this when we're ready to make that move.

I was trying to determine the number of cloned sites we have, but I noticed that we do not keep track of whether a site is a clone in the cac_handle_site_clone() function. We should keep track of cloned sites like we do for groups in the cac_handle_group_clone() function and use either the wp_blogmeta or wp_bp_user_blogs_blogmeta DB table to record the site ID of the parent site.

I noticed this myself last week but didn't take the time to fix it. Added in https://github.com/cuny-academic-commons/cac/commit/70f7bb4a79d66ba7c9b9b727e96176cccf80b75f

The many thousands of tables on sites without any of these plugins activated makes me suspicious of a switch_to_blog issue somewhere. Related: https://redmine.gc.cuny.edu/issues/15194

Yeah, it's a possibility. Here's a very small data point, though. It appears that 9 sites have had actionscheduler tables created since 2025-01-30 (the first one that was caught by my logger). One of them was one that I manually triggered using a CLI command. The other 8 were clones. I can tell because those are created using "CREATE TABLE ... LIKE ..." https://github.com/cuny-academic-commons/cac/blob/70f7bb4a79d66ba7c9b9b727e96176cccf80b75f/wp-content/plugins/bp-custom.php#L3107 So I don't yet have a single instance of actionscheduler tables being created by actionscheduler itself. We only introduced this type of site cloning in the last couple years, so I don't think it explains the many thousands of tables, but perhaps it's a bigger source that we've previously appreciated.

On that note, perhaps we should add some logic to the cloning process that skips the cloning of database tables related to plugins that are no longer active. This would have to be a manually-created whitelist:

if ( gravityforms is not active ) {
  exclude [ gf_entry, gf_entry_meta, gf_revisions ...]
}

if ( none of the actionscheduler plugins is active ) {
  exclude [ actionscheduler_groups, actionscheduler_actions ... ]
}

etc

This would get ugly, and it would take maintenance (plugins occasionally change their schemas), and it could cause very rare issues where the admin of a cloned site expects to be able to reactivate a plugin and see old data. On the flip side, it would cut down meaningfully on the creation of garbage tables. What do you think?

Actions #9

Updated by Jeremy Felt 3 days ago

perhaps we should add some logic to the cloning process that skips the cloning of database tables related to plugins that are no longer active

I think this makes sense.

Actions #10

Updated by Boone Gorges 1 day ago

wp-smtp is no longer used on the Commons, so I removed it and we don't need to check for it https://github.com/cuny-academic-commons/cac/commit/171058a2d42a4e88292541d46d7b082e38660304

While investigating the issue, I found a bug that was causing too many tables to be queried for clone in some cases. See https://github.com/cuny-academic-commons/cac/commit/8c849bd7561052fa398809e969ba4a3f0ad15857. If you were trying to clone a site with a 4-digit blog ID, you'd get 10x too many tables, and God help you if you were trying to clone an even-older site.

Regarding the idea about skipping tables during clone, here's a framework: https://github.com/cuny-academic-commons/cac/commit/4f734a38f9d9a4a090b1861b0ba86a453ff34999 What do others think of the approach?

Actions

Also available in: Atom PDF