Project

General

Profile

Actions

Bug #19842

closed

commons is loading very slowly

Added by Marilyn Weber 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority name:
Immediate
Assignee:
-
Category name:
-
Target version:
Start date:
2024-02-27
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

noticed this on my PC using Chrome and Firefox, confirmed by a student on a MAC with Safari


Files

clipboard-202402271257-sgocj.png (48.4 KB) clipboard-202402271257-sgocj.png Marilyn Weber, 2024-02-27 12:57 PM

Related issues

Has duplicate CUNY Academic Commons - Bug #19853: new problem commons.gc.cuny.edu took too long to respondDuplicate2024-02-28

Actions
Actions #1

Updated by Marilyn Weber 9 months ago

Now down -

Actions #2

Updated by Colin McDonald 9 months ago

Thanks Marilyn, IT is aware. The site database actually restarted this morning, and then the Commons came up briefly about ten minutes ago, and then the database restarted again. A restart can take 1.5 to 2 hours, though I'm not sure about a restart AFTER a restart. We're all monitoring too and can use this ticket for updates.

Actions #3

Updated by Raymond Hoh 9 months ago

Boone, I'm seeing some fatal errors in the PHP log related to nextgen-gallery:

[Tue Feb 27 12:57:21 2024] [error] [pid 40024] sapi_apache2.c(349): [client 54.232.116.4:54048] PHP Fatal error:  Uncaught Error: [] operator not supported for strings in /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/Util/Transient.php:142

Stack trace:
#0 /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/Util/Transient.php(162): Imagely\\NGG\\Util\\Transient->_track_key()
#1 /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/Util/Transient.php(250): Imagely\\NGG\\Util\\Transient->set()
#2 /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/DataMapper/DriverBase.php(89): Imagely\\NGG\\Util\\Transient::update()
#3 /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/DataMapper/DriverBase.php(63): Imagely\\NGG\\DataMapper\\DriverBase->update_columns_cache()
#4 /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/DataMapper/WPPostDriver.php(31): Imagely\\NGG\\DataMapper\\DriverBase->lookup_columns()
#5 /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/DataMapper/DriverBase.php(23): Imagely\\NGG\\DataMapper\\WPPostDriver->lookup_columns()
#6 /va in /var/www/html/commons/www/wp-content/plugins/nextgen-gallery/src/Util/Transient.php on line 142

The 3rd stack trace does a call to "SHOW COLUMNS", while the last stack trace references a change we made to address nextgen's database problems in #18098: https://github.com/cuny-academic-commons/cac/commit/a75eafb4b19308fe2a728804ac7aff3d9556cf2c . I'm not sure whether this part of the code is related to today's performance issues, however we did update nextgen-gallery in today's maintenance release.

In the latest nextgen-gallery update, I do see an added call for opcache_reset(): https://github.com/cuny-academic-commons/cac/commit/02cc1d7a6ddada002f6b92044df654070546bbbe#diff-9c567b0dbdd917a47a33bd931d78905012a1a47ba0297cabb98ba60aad50dff2 . This looks like it would affect the performance of our install. Perhaps we can try reverting the update to nextgen-gallery? https://github.com/cuny-academic-commons/cac/commit/02cc1d7a6ddada002f6b92044df654070546bbbe

Actions #4

Updated by Boone Gorges 9 months ago

Wow, I don't know how you managed to stumble on that opcode_reset() call, but it absolutely looks problematic. I've reverted to the old version in https://github.com/cuny-academic-commons/cac/commit/79b711c71285f3187717deee6ba225abb965c934, a change which has been deployed to the production site for whenever it decides to come back to life.

For good measure, I've also disabled the nextgen-gallery cache completely, using define( 'PHOTOCRATI_CACHE', false ); in cac-env-config.php. This avoids the whole issue of the cache class calling SHOW COLUMNS.

Actions #5

Updated by Boone Gorges 9 months ago

Ray, the site briefly came online again, but things are loading very slowly and it appears that the db server is going to crash again.

I'm considering rolling this morning's release back altogether to see if things stabilize the next time the db comes back online. What do you think?

Actions #6

Updated by Raymond Hoh 9 months ago

I think at this point, we have to consider rolling back to see if the problem stems from a code update.

Actions #7

Updated by Boone Gorges 9 months ago

Agreed. I've done this by checking out the last tag (2.3.2) directly on the server:

git checkout 2.3.2

rather than rolling back on the master branch.

If this solves the problem on the next reboot, then we can talk about next steps.

Actions #8

Updated by Boone Gorges 9 months ago

The database finally restarted about 15 minutes ago. The site was continuing to run extremely slowly at first, but now seems to be stabilizing somewhat. Let's keep an eye out over the next hour or two.

Actions #9

Updated by Boone Gorges 9 months ago

This evening I've been picking through the changelog to narrow down the potential problems.

I created a new branch 2.3.x-debug, created from the 2.3.2 tag. I switched the production site over to this new branch. Then, I got a list of the changesets to be processed:

git log --pretty=oneline 2.3.2..master

I skipped the BP-12-related commits, and began with the plugin/theme updates. I went through most of the minor ones (which came first) and got about halfway through the major ones. I'm pausing for the night and will continue tomorrow when I have the time. Here's my status log (items in the top section are yet to be processed; I'm doing them from bottom up, then moving them to the bottom of the OK list):

a59814fbe7bf40a16abd2afacc1d9532054f2c94 Update theme: twentytwentyone.
a284d127e47bb8c492d91814795960b17c257556 Update theme: twentytwenty.
570ab3c2c2896fdcd99a36f4c935c31cd0e7bea6 Update theme: twentyseventeen.
d8540de2b1c2b4b7e2a3b4acdb907de01d5d2b92 Update theme: sydney.
6466e6409943f6f69b062af571d35e16195654d6 Update theme: neve.
cce1129f9e09019a206e8078c783e819458afec6 Update plugin: wordpress-seo.
76f9781ebfeed6a3f1b84d6134039bd5a5eb4e02 Update plugin: wp-to-twitter.
2a0d6db63bbedbbfebfb506949fc662e8d2da4e4 Update plugin: twentig.
213c7607ba4a1811b15380e1980a21b78e9b78b4 Update plugin: the-events-calendar.
24a40ff4bd8c2b3472229d6ea65a8a2c10e83df4 Update plugin: so-widgets-bundle.
e827d8cb820aaafb12c25b30aee58431ab5814eb Update plugin: google-calendar-events.
d0691bbc5cdeca5376bd5401cc524a53daa22c54 Update plugin: share-this-image.
02cc1d7a6ddada002f6b92044df654070546bbbe Update plugin: nextgen-gallery.
9b957764ec7b38062d4732694f569be2eeddd77d Update plugin: link-library.

= OK =
b2f719dba15bddfe88b9ee1c7361e02e814a87c7 Update plugin: interactive-3d-flipbook-powered-physics-engine.
8c0e378f6364a0db004f04ba2dbb3ae20f975f06 Update plugin: download-manager.
f4f523049a4eafcfcbae5167a5f1e1b517f7aeaa Update plugin: draw-attention.
9e6f6556aae1ccae9c873259b55c5b211c07af35 Update plugin: easy-appointments.
6a632111b2a7ffe1f31409219c7044e69615c4df Update plugin: embedpress.
c58dcd19e1af76a18ae65a931d23bf6306644337 Update plugin: event-organiser.
037f67d479c9e3222742e9ca2b6ad41ce219c015 Update plugin: otter-blocks.
6da18fb77046429e72b7877ce83be55ccd323d1f Update plugin: content-protector.
adc84e38da7f78f45171577bd8e3238563ad4fd7 Update plugin: password-protected.
fcd10d7568d3a12bfff70d3792428e079b85603e Update plugin: shortcodes-ultimate.
f9a0deabefb42c7ec169b404ad3724d01f32a07d Update plugin: custom-facebook-feed.
a33ad14c1c92633e8e7f1771990b79bd6d667fe4 Update plugin: team-members.
673c15e713eecffc7303f048af2cd9ebeba31253 Update plugin: user-switching.
1deb032cb778dccb0990e0664b0e79a623ac0b67 Update plugin: yet-another-related-posts-plugin.
911ae965e46a7cb5000c46cd68a4602bedd37782 Update theme: colormag.
a5f23d16662c0d927b65a1eca550939e803f9be7 Update theme: raft.
57e12d95ca80b1f25d9c78e9f3c0a5d367bcada4 Update plugin: better-search.
c168efaa525211f97c7ae56e0760d56f219b3e84 Update plugin: block-visibility.
46327987c40d4ebb532172ebd73a81ff50a8ab5b Update plugin: business-directory-plugin.
426e19d5e85d2a258f83c25a10840f0bee1948a2 Update plugin: event-tickets.
97623403dc5a42cdc81bd40f5bd9832a369ac0d5 Update plugin: filter-everything.
e1d5523b3eb9d6663b3964a7466ca55d668ff826 Update plugin: gutenberg.
1c4d226cb3dbfad387843b388932630614eafb44 Update plugin: hypothesis.
d8d73d2d74484d674f66de65334641c10a147bb9 Update plugin: jetpack.
d232fc1fa840ef9700c92b379a7db62409134995 Update plugin: jetpack-boost.
fa1a3880605f13e3fe123cb8665397aa780ae497 Update plugin: kadence-blocks.

I suspect that the problematic commit is one of the plugin updates I've yet to process. If it's not, it suggests that BP 12 and related commits are in fact the problem. I'm hopeful we don't get to that point, because it's hard to pick those commits apart from each other.

Actions #10

Updated by Raymond Hoh 9 months ago

Thanks for going through the commits one-by-one, Boone.

Here are a few more commits in 2.3.x that will also need to be processed:

b95270cda7539b81be202e69c6288cea6eb17cab Update plugin: events-calendar-pro
40de3e82b2f44b088819ac0f7246a91066f007ce Update plugin: bridge-core
e6ba9827914cf15cd9fe8f5ddf6538c8a4f77b43 Update plugin: bridge
Actions #11

Updated by Boone Gorges 9 months ago

It looks tentatively like the wordpress-seo update e54cca498f6d7c907a24edacef1de8d61f236c7f is the culprit. I put it on the site momentarily and pulled it as soon as I noticed a performance problem, but it might not have been fast enough to prevent the site from being unusable. We'll see in the next few minutes.

Actions #12

Updated by Boone Gorges 9 months ago

  • Has duplicate Bug #19853: new problem commons.gc.cuny.edu took too long to respond added
Actions #13

Updated by Marilyn Weber 9 months ago

It's back up, thanks!

Actions #14

Updated by Raymond Hoh 9 months ago

wordpress-seo is currently at v22.1; the version that broke our install is at v21.9.1.

I took a look at their Github repo and one noticeable issue when upgrading from v21.7 to v21.9 involves a form of migration that potentially adds a ton of database rows: https://github.com/Yoast/wordpress-seo/issues/21076 . Unfortunately, the issue was closed without any further investigation.

Also I see that wordpress-seo does an opcache_reset() call much like nextgen-gallery, but this isn't a new change in wordpress-seo 21.9.1. There is a ticket on Github about this call here: https://github.com/Yoast/wordpress-seo/issues/20654 .

Actions #15

Updated by Boone Gorges 9 months ago

Thanks for researching this, Ray.

I read over the wordpress-seo changeset and I couldn't see anything obvious. But the kinds of updates or migrations that could trigger this might happen in many different ways, which would not necessarily be obvious just by reading a changeset.

The GitHub ticket you linked is interesting and could be related, but the specific code path from the stack trace indicates that it was triggered by an admin visit. I doubt that this specific route is responsible for our crashing problem, since our crashing problem happens almost immediately after deployment. So it must be something that's ultimately hooked to 'init' or something similar.

Anyway, in my test branch I rolled back the wordpress-seo update, and then added some theme updates and deployed them. The database server went down almost immediately after I did so. I'm pretty skeptical that theme updates would cause this sort of problem (at least not with the specific themes that were updated, which I know to be fairly non-invasive) but in an abundance of caution I've rolled back these changes too. The production site now stands like this: on the 2.3.x-debug branch, with the following changesets applied:

b2f719dba15bddfe88b9ee1c7361e02e814a87c7 Update plugin: interactive-3d-flipbook-powered-physics-engine.
8c0e378f6364a0db004f04ba2dbb3ae20f975f06 Update plugin: download-manager.
f4f523049a4eafcfcbae5167a5f1e1b517f7aeaa Update plugin: draw-attention.
9e6f6556aae1ccae9c873259b55c5b211c07af35 Update plugin: easy-appointments.
6a632111b2a7ffe1f31409219c7044e69615c4df Update plugin: embedpress.
c58dcd19e1af76a18ae65a931d23bf6306644337 Update plugin: event-organiser.
037f67d479c9e3222742e9ca2b6ad41ce219c015 Update plugin: otter-blocks.
6da18fb77046429e72b7877ce83be55ccd323d1f Update plugin: content-protector.
adc84e38da7f78f45171577bd8e3238563ad4fd7 Update plugin: password-protected.
fcd10d7568d3a12bfff70d3792428e079b85603e Update plugin: shortcodes-ultimate.
f9a0deabefb42c7ec169b404ad3724d01f32a07d Update plugin: custom-facebook-feed.
a33ad14c1c92633e8e7f1771990b79bd6d667fe4 Update plugin: team-members.
673c15e713eecffc7303f048af2cd9ebeba31253 Update plugin: user-switching.
1deb032cb778dccb0990e0664b0e79a623ac0b67 Update plugin: yet-another-related-posts-plugin.
911ae965e46a7cb5000c46cd68a4602bedd37782 Update theme: colormag.
a5f23d16662c0d927b65a1eca550939e803f9be7 Update theme: raft.
57e12d95ca80b1f25d9c78e9f3c0a5d367bcada4 Update plugin: better-search.
c168efaa525211f97c7ae56e0760d56f219b3e84 Update plugin: block-visibility.
46327987c40d4ebb532172ebd73a81ff50a8ab5b Update plugin: business-directory-plugin.
426e19d5e85d2a258f83c25a10840f0bee1948a2 Update plugin: event-tickets.
97623403dc5a42cdc81bd40f5bd9832a369ac0d5 Update plugin: filter-everything.
e1d5523b3eb9d6663b3964a7466ca55d668ff826 Update plugin: gutenberg.
1c4d226cb3dbfad387843b388932630614eafb44 Update plugin: hypothesis.
d8d73d2d74484d674f66de65334641c10a147bb9 Update plugin: jetpack.
d232fc1fa840ef9700c92b379a7db62409134995 Update plugin: jetpack-boost.
fa1a3880605f13e3fe123cb8665397aa780ae497 Update plugin: kadence-blocks.
9b957764ec7b38062d4732694f569be2eeddd77d Update plugin: link-library.
d0691bbc5cdeca5376bd5401cc524a53daa22c54 Update plugin: share-this-image.
e827d8cb820aaafb12c25b30aee58431ab5814eb Update plugin: google-calendar-events.
24a40ff4bd8c2b3472229d6ea65a8a2c10e83df4 Update plugin: so-widgets-bundle.
213c7607ba4a1811b15380e1980a21b78e9b78b4 Update plugin: the-events-calendar.
2a0d6db63bbedbbfebfb506949fc662e8d2da4e4 Update plugin: twentig.
76f9781ebfeed6a3f1b84d6134039bd5a5eb4e02 Update plugin: wp-to-twitter.

It seems like my attempts to isolate the issue are themselves error-prone, since I thought I'd identified wordpress-seo as the problem, yet the site still went down after rolling that update back. I'm guessing this is because the wordpress-seo effects took some time to take hold, but I'm unsure how to debug this any further.

I'm unsure what to do at this point. For now, I can simply pause, and refrain from updating anything. But this is not sustainable in the long or even the medium term. And without the ability to debug the issue further, or reproduce it in an environment other than production, I don't know how I can narrow down the problem any further. Ray, what do you think?

Actions #16

Updated by Jeremy Felt 9 months ago

This may be absurd, but: can we do something hacky with `SAVEQUERIES` to generate a log of all queries leading up to the issue? Maybe dump the log on something hooked very late to `plugins_loaded` and `init`? That could help identify a flood of expensive requests.

Actions #17

Updated by Raymond Hoh 9 months ago

I remember an email thread about bots hitting the events-manager plugin that caused performance issues. Boone, email subject is titled "INC0080094 (Commons Performance Alerts + site slowness) has additional information" from August 2023.

So I did an access log grep of 'ajaxCalendar' and I'm seeing a ton of GET hits from an Alibaba Cloud IP (47.76.35.19) pinging event-manager URL paths with the "ajaxCalendar=1" query parameter multiple times per second. Can we try blocking this IP?

Update - I've just added an entry in our .htaccess file for the 47.76.35.19 IP for now.

Actions #18

Updated by Boone Gorges 9 months ago

Ray, good call. My previous fix was to discard traffic that was trying to hit /events/calendar/...?ajaxCalendar without a Referer, but I think this approach can be safely expanded to /events/...?ajaxCalendar without a Referer. I've made the change and removed your IP-specific clause, and those requests are now being 403ed as expected.

It's not clear to me that this specific behavior would be responsible for the site crashing after the plugin updates, but certainly it's a good thing to stop it.

This may be absurd, but: can we do something hacky with `SAVEQUERIES` to generate a log of all queries leading up to the issue? Maybe dump the log on something hooked very late to `plugins_loaded` and `init`? That could help identify a flood of expensive requests.

No, this is not absurd. Jeremy, would you draft a small mu-plugin that does this? Then I can determine a time when it will be reasonably safe to deploy it, along with a switch back to the master branch (ie to the 2.3.3 release). I'd let it run just for a few seconds before switching back to the debug branch. The site might crash, but it might give us useful diagnostic info.

Actions #19

Updated by Jeremy Felt 9 months ago

Here are two versions of an mu-plugin: https://gist.github.com/jeremyfelt/307a3eb497443a74c7d411428433fb05

The first logs data as JSON (which I find easily siftable via VisiData). The other logs it as a more raw pipe delimited text file.

This includes the query, the time it took, the calling stack, the domain, and the request URI.

I picked a few actions where I thought something may get hung up, but it's more of a scattered guess. :)

Actions #20

Updated by Boone Gorges 9 months ago

  • Status changed from New to Resolved
  • Target version set to 2.3.3

Jeremy, thanks for this.

This morning I went to run a test using the plugin. I set it up and made sure it was logging properly. Then I enabled it and switched to the master branch. I waited, but the site never slowed down. Whatever had previously been causing the problem doesn't appear to be doing it anymore.

While I'm pleased that the update is complete and the site is available, I don't have a good explanation for what's happened. A few ideas:

1. The 2.3.3 release included a plugin update that initiated a costly database routine on many sites at once. But all of those routines successfully completed, so that when I switched over today, there was no more undue strain on the database.
2. Ray noted earlier that a specific endpoint was receiving huge amounts of traffic, and we put a block in place to prevent the traffic. It's possible that this traffic, combined with some change in the 2.3.3 release, was triggering the outage. With the traffic blocked, the outage did not occur.
3. There was an environmental issue yesterday that isn't in place today. For example, if there was a database backup taking place yesterday at the same time I did the update, that backup, combined with some plugin update, may have overloaded the system somehow. That environmental issue may since have passed. (Note that I don't think that a database backup specifically was the issue, but it could've been something like that.)

I guess I'm going to close this ticket and we can refer back to it in case of future similar issues.

Actions

Also available in: Atom PDF