Project

General

Profile

Actions

Bug #20515

closed

Is the Commons Down?

Added by scott voth 21 days ago. Updated 8 days ago.

Status:
Resolved
Priority name:
Normal
Assignee:
Category name:
Server
Target version:
Start date:
2024-06-26
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

I am getting "The CUNY Academic Commons is experiencing technical problems."


Files

Commons load time June 27 2024.mp4 (4.1 MB) Commons load time June 27 2024.mp4 Laurie Hurson, 2024-06-27 01:54 PM
Screenshot 2024-06-27 at 1.56.56 PM.png (784 KB) Screenshot 2024-06-27 at 1.56.56 PM.png Laurie Hurson, 2024-06-27 01:57 PM
2024-07-02_074445.png (63 KB) 2024-07-02_074445.png Raymond Hoh, 2024-07-02 10:49 AM
CAC load times July 2 2024.mp4 (4.46 MB) CAC load times July 2 2024.mp4 Laurie Hurson, 2024-07-02 12:18 PM

Related issues

Related to CUNY Academic Commons - Bug #20520: CAC resposinveness/performanceDuplicate2024-06-29

Actions
Actions #1

Updated by scott voth 21 days ago

Site is back up.

Actions #2

Updated by Colin McDonald 21 days ago

I just noticed that it is back down.

Actions #3

Updated by Laurie Hurson 20 days ago

I am seeing it is down now -- thursday morning

Actions #4

Updated by Colin McDonald 20 days ago

Seems to be back up now.

Actions #5

Updated by Laurie Hurson 20 days ago

Just an update: I am still experiencing significant slowness, and have had a few users reach out to me to ask about stability.

Actions #6

Updated by Laurie Hurson 20 days ago

I think something might be going on with TLS handshake? See loading info at the bottom of the video, the data transfer just keeps running and TLS is called multiple times.

the info at the bottom also mentioned data transfer from newgcstudents.commons.gc.cuny.edu once and I dont know why and if thats related

Sorry for the long video

Actions #8

Updated by Boone Gorges 20 days ago

Thanks, Laurie. I've seen this sort of behavior before too, and hypothesized that it had something to do with the TLS handshake. But in those instances, it ended up looking like the TLS stuff was just a symptom of the underlying problem, which is a problem with the database connection.

I am at a loss for how to proceed.

Actions #9

Updated by Raymond Hoh 18 days ago

  • Related to Bug #20520: CAC resposinveness/performance added
Actions #10

Updated by Raymond Hoh 16 days ago

In our last maintenance update (v2.4.1) on June 25th, we updated the Editoria11y Accessibility Checker plugin to v1.0.17. This plugin update included a database upgrade routine that can degrade performance especially on a multisite install. See https://github.com/itmaybejj/editoria11y-wp/issues/32 for a detailed report.

The plugin author has updated the plugin to v1.0.18 to address this issue; I've committed it here -- https://github.com/cuny-academic-commons/cac/commit/bcc4595d1b0a37f4514e6c85ed715a37b40ee445 -- and pushed the update to production in hopes that the routine is fixed. Hopefully this helps with database responsiveness, but will keep checking the plugin update list to see what else could have caused the database slowness.

Actions #11

Updated by Laurie Hurson 16 days ago

Hi Ray,

Thanks for this insight. I believe Editorially is currently installed on every site on the Commons at a network level (similar to Imsanity, etc). We had decided to add this plugin on all sites as an effort towards making sites more accessible and alerting users to accessibility tips. I have actually gotten feedback that this plugin confuses users because it adds many many flags to the front end that are only visible (I think) to site admins.

Would it be worth deactivating it on sites at network level (if possible?) to see if this helps with the current loading and database issues?

Actions #12

Updated by Colin McDonald 16 days ago

Thanks for your work on this, Ray. Coincidentally or not, I am seeing the "technical problems" downtime message on the Commons right now.

Actions #13

Updated by Raymond Hoh 16 days ago

Coincidentally or not, I am seeing the "technical problems" downtime message on the Commons right now.

Yeah, the Commons is down again unfortunately.

Would it be worth deactivating it on sites at network level (if possible?) to see if this helps with the current loading and database issues?

I've written a filter to disable Editoria11y temporarily on production in /wp-content/mu-plugins/20515-disable-editoria11y.php. We'll see if that helps once the Commons is back up. I'm still going through the plugin and theme list to see if anything else jumps out.

Actions #14

Updated by Raymond Hoh 16 days ago

The Commons is back online. We'll see if disabling the Editoria11y plugin will stabilize the database responsiveness issues. I can confirm in the slow query log Yiu Ming provided that Editoria11y did generate many slow queries, which could have caused a cascading delay with other database queries. Will check in a bit later to see how the Commons is doing.

Actions #15

Updated by Raymond Hoh 15 days ago

I can confirm that disabling the Editoria11y plugin has stabilized load times on the Commons:


(By the way, this graph was generated with HetrixTools that I mention in #18841 .)

Can anyone else confirm that the Commons is running about the same as before?

Actions #16

Updated by Raymond Hoh 15 days ago

Adding Scott and Marilyn as watchers.

Actions #17

Updated by Laurie Hurson 15 days ago

Hi Ray,

The Commons does seem to be working and loading better!

I am still seeing that "[Groups/Sites/members] from your campus" does not load, and that the home page keeps loading and transferring data continually, even 45 seconds+ after initial page load. But perhaps this is related to pulling the "from your campus" info. See video.

Actions #18

Updated by Raffi Khatchadourian 15 days ago

I haven't received a JetPack alert since 6 pm last night, but I'd say that my site still seems extremely slow.

Actions #19

Updated by scott voth 15 days ago

I can confirm it is running a lot faster now.

Actions #20

Updated by Colin McDonald 15 days ago

Seems to be working ok for me. I just posted an update to the team forum, which took forever a few days ago to register/send, and it was pretty instant.

Actions #21

Updated by Boone Gorges 8 days ago

  • Status changed from New to Resolved
  • Target version set to Not tracked
Actions #22

Updated by Raymond Hoh 8 days ago

I got an email notification that there was some downtime today between 4:35am ET and 6:11am ET.

I looked at the error log to see what could be a cause and saw these entries:

[Tue Jul 09 04:35:01 2024] [notice] [pid 108823] sapi_apache2.c(349): [client 116.202.254.214:54230] WordPress database error WSREP has not yet prepared node for application use for query SELECT * FROM `wp_cavalcade_jobs` WHERE site = 1120 AND hook = 'jetpack_v2_heartbeat' AND args = 'a:0:{}' AND status IN('waiting','running') ORDER BY nextrun ASC LIMIT 1 made by require('wp-blog-header.php'), require_once('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), do_action('plugins_loaded'), WP_Hook->do_action, WP_Hook->apply_filters, Jetpack->configure, Jetpack_Heartbeat::init, Automattic\\Jetpack\\Heartbeat::init, Automattic\\Jetpack\\Heartbeat->__construct, wp_next_scheduled, wp_get_scheduled_event, apply_filters('pre_get_scheduled_event'), WP_Hook->apply_filters, HM\\Cavalcade\\Plugin\\Connector\\pre_get_scheduled_event, HM\\Cavalcade\\Plugin\\Job::get_jobs_by_query
[Tue Jul 09 04:35:02 2024] [notice] [pid 109152] sapi_apache2.c(349): [client 84.51.29.236:4803] WordPress database error WSREP has not yet prepared node for application use for query SELECT display_meta, notifications FROM wp_1859_gf_form_meta WHERE form_id=1 made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/twentyfourteen/single.php'), get_sidebar, locate_template, load_template, require_once('/themes/twentyfourteen/sidebar.php'), dynamic_sidebar, WP_Widget->display_callback, GFWidget->widget, GFFormsModel::get_form_meta

I looked up this error -- https://galeracluster.com/library/documentation/crash-recovery.html -- and it looks like one of the database cluster nodes went down, which might have triggered a database restart from Lihua's scripts. I haven't contacted IT about this yet.


Sorry for the double-post for those that already received it. I previously posted this message in the wrong ticket (#18841).

Actions

Also available in: Atom PDF