Feature #18841
openDowntime planning
Added by Colin McDonald about 1 year ago. Updated 5 months ago.
0%
Description
In light of recent issues with our server, we'd like to have a more concrete plan in place should the Commons experience significant downtime. Let's gather ideas, next steps, and preparatory materials here.
As a starting point, during the last few meetings we discussed establishing a Commons archive or backup. We need to hash out what that would mean:
- What content would we save and what wouldn't we?
- Would we save full site/group configurations (members, preferences, etc) prioritize Library and Media uploads, etc?
- Where would this archive live, and how would it be updated/tested/maintained?
- What would be the plan for using this in the event of an outage?
- Could we maintain an external list of site/group admin emails for emergency outreach?
Updated by Boone Gorges about 1 year ago
The Graduate Center maintains daily backups. I believe that these backups are kept on-site for something like 30-40 days (as space allows) and that they're afterward moved into cold storage. These are full-fidelity backups that can be restored for data recovery, etc. In an emergency, it would be theoretically possible to use one of these backups to launch a new instance of the Commons. But this assumes that the backups are accessible in this emergency, and assumes that the GC team is going to be available to stand up the backup.
So, to put this in terms of Colin's questions: We already take full-fidelity backups. But their utility during emergency downtime is not clear. It's likely they'd only be accessed if the site was down for many days, and that this downtime forced us to make an emergency (permanent) hosting change.
In a true emergency, it's probably going to be largely out of our team's hands how fast we get the site back up and running. And I think that we don't have anything approaching the necessary resources - money for servers, dev staff, IT support - that are required to have true "failover" capacity - ie, a version of the site that we could switch over to in an emergency.
What we can control, and what is thus something that we should prepare for, is a communication plan for our users. Here's what's already in place:
1. If our WordPress application goes down, but Apache is still up and running, we show a custom 500 error page. This is https://commons.gc.cuny.edu/wp-content/error500.html. In certain kinds of emergencies, this page would remain available, and we'd be able to customize the message shown here.
2. We have the Commons status blog on wordpress.com https://commonsstatus.wordpress.com/. Critically, this blog is hosted off the Commons, so that we're able to share information even if the Commons is completely unavailable. This blog is linked from the 500 error page, so in the kind of emergency where users are able to see the 500 page, they can follow the link. In other scenarios, it's likely that users will not have preexisting knowledge of the status blog, so we'd likely want to include it in any communication. For example, in an emergency where the Commons is complete inaccessible for several days, I can imagine posting updates to the status blog several times daily to give regular updates; in contrast, we might send a single email to a certain subset of Commons users, informing them about the situation and letting them know that they can watch the status blog for more information.
Perhaps the critical missing piece in our communication plan is that we have no way of sending emails to Commons users without, at the very least, having access to the MySQL database where email addresses are stored. Having an email list is probably going to a central piece of our communication plan in most any emergency, so it's likely our first priority to develop a plan for creating and maintaining an email list. Some considerations:
a. The list of emails belonging to Commons users is not static. New members sign up, while existing members change their addresses or even delete their accounts.
b. A list of 45k+ email addresses includes a great deal of sensitive contact information, a fact that we have to keep in mind as we decide how to store it, who has access, etc.
c. Most email addresses in the system belong to Commons members who have not logged into the site in a long time. They are no longer "active" users in any sense. Some may no longer be part of the CUNY community. Some may have passed away.
To mitigate some of these concerns, we might come up with heuristics that help us to narrow down a subset of users who it would be critical to contact in an emergency. Then, perhaps we can either (a) only store those email addresses, or (b) store all email addresses in a way that indicate whether a user is in one or more "important" categories. A few ideas about such heuristics:
- Users who have logged into the site in the last 6 or 12 months
- Users who are administrators of one or more sites (sites that have been active in the past x months?)
- Users who are admins of one or more groups (...that have been active in the past x months?)
- Members of groups or sites that have been active in the past x months
- Owners/members of sites/groups with the Teaching label
I imagine it's likely that, in a case where the Commons is unavailable, the first thing users will ask for is a list of email addresses for their groups/sites. I'm thinking primarily of an instructor who has to inform their students that the Commons is unavailable. If others agree, and if this is a service we want to be able to provide, then we need to have a more complicated mechanism for collecting and storing email addresses.
Because the data changes over time, we probably need a tool that can be run on a regular basis, either in an automated fashion or manually. We'd have to decide on an appropriate interval - say, once per month. Then, we have to decide where to store this information. Is, for example, a shared folder on Google Drive secure enough for this purpose?
These are some initial thoughts - hopefully they can help to get a conversation started.
Updated by Boone Gorges 9 months ago
- Assignee changed from Boone Gorges to Colin McDonald
- Target version set to Not tracked
Updated by Raymond Hoh 5 months ago
Due to recent downtime, I looked into creating a new status page that has automated uptime checking. See https://cunycommons.instatus.com . This uses Instatus for the status page and uses a third-party integration with HetrixTools for the actual uptime monitoring and automated posting to Instatus. When the Commons is down, HetrixTools will send a webhook to Instatus and an automated post will be made about the downtime.
Users can subscribe to individual downtime incidents, which is handy if one wants to know when the Commons is back online again after a specific downtime event. Can also import a list of subscribers via CSV in the admin dashboard. About the design, we can add custom CSS to style the page. For example, I added the Poppins font that we use for our site. If anyone is interested, I can add you as a team member to Instatus and/or HetrixTools so you can check it out.
Updated by Colin McDonald 5 months ago
Thanks for this Ray, it looks like a really useful and clear tool. I'd be curious to take a look as a team member.
One thing that comes to mind is whether the Recent Notices section is all automated and tied to the regular checks for outages, or whether we could post other updates into that feed (like say, an RSS of our status blog). It might be nice to have the auto-updates mixed in with our manual ones. Then perhaps we could even replace our default "technical problems" link with this one?
Updated by Raymond Hoh 5 months ago
Colin, I've added you to Instatus as an Owner.
I also wanted to add you to HetrixTools, but it looks like that's a premium feature and I'm just testing with a free account at the moment. However, there is a public uptime page available to take a look at: https://hetrixtools.com/report/uptime/65b1dc51c2df4bf4b5cf78ba6e7bb4b6/ .
As for:
One thing that comes to mind is whether the Recent Notices section is all automated and tied to the regular checks for outages, or whether we could post other updates into that feed (like say, an RSS of our status blog).
Yes, we can edit automated notices after they've been posted or we can post our own entries through the dashboard as well. Can also post new notices by email for those that do not want to login to the dashboard to post a new notice. Unfortunately, it doesn't look like you can add a RSS feed to pull into the stream.
Then perhaps we could even replace our default "technical problems" link with this one?
I think that would be the goal. I do like the automated downtime posting as sometimes I forget to update the wordpress.com site until it's a bit too far into the downtime.
Updated by Raymond Hoh 5 months ago
I got an email notification that there was some downtime today between 4:35am ET and 6:11am ET.
I looked at the error log to see what could be a cause and saw these entries:
[Tue Jul 09 04:35:01 2024] [notice] [pid 108823] sapi_apache2.c(349): [client 116.202.254.214:54230] WordPress database error WSREP has not yet prepared node for application use for query SELECT * FROM `wp_cavalcade_jobs` WHERE site = 1120 AND hook = 'jetpack_v2_heartbeat' AND args = 'a:0:{}' AND status IN('waiting','running') ORDER BY nextrun ASC LIMIT 1 made by require('wp-blog-header.php'), require_once('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), do_action('plugins_loaded'), WP_Hook->do_action, WP_Hook->apply_filters, Jetpack->configure, Jetpack_Heartbeat::init, Automattic\\Jetpack\\Heartbeat::init, Automattic\\Jetpack\\Heartbeat->__construct, wp_next_scheduled, wp_get_scheduled_event, apply_filters('pre_get_scheduled_event'), WP_Hook->apply_filters, HM\\Cavalcade\\Plugin\\Connector\\pre_get_scheduled_event, HM\\Cavalcade\\Plugin\\Job::get_jobs_by_query [Tue Jul 09 04:35:02 2024] [notice] [pid 109152] sapi_apache2.c(349): [client 84.51.29.236:4803] WordPress database error WSREP has not yet prepared node for application use for query SELECT display_meta, notifications FROM wp_1859_gf_form_meta WHERE form_id=1 made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/twentyfourteen/single.php'), get_sidebar, locate_template, load_template, require_once('/themes/twentyfourteen/sidebar.php'), dynamic_sidebar, WP_Widget->display_callback, GFWidget->widget, GFFormsModel::get_form_meta
I looked up this error -- https://galeracluster.com/library/documentation/crash-recovery.html -- and it looks like one of the database cluster nodes went down, which might have triggered a database restart from Lihua's scripts. I haven't contacted IT about this yet.
Oops, posted this in the wrong ticket! Was meant to go to #20515.