Feature #18841


Downtime planning

Added by Colin McDonald 3 months ago. Updated 3 months ago.

Priority name:
Category name:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Deployment actions:


In light of recent issues with our server, we'd like to have a more concrete plan in place should the Commons experience significant downtime. Let's gather ideas, next steps, and preparatory materials here.

As a starting point, during the last few meetings we discussed establishing a Commons archive or backup. We need to hash out what that would mean:

- What content would we save and what wouldn't we?
- Would we save full site/group configurations (members, preferences, etc) prioritize Library and Media uploads, etc?
- Where would this archive live, and how would it be updated/tested/maintained?
- What would be the plan for using this in the event of an outage?
- Could we maintain an external list of site/group admin emails for emergency outreach?

Actions #1

Updated by Boone Gorges 3 months ago

The Graduate Center maintains daily backups. I believe that these backups are kept on-site for something like 30-40 days (as space allows) and that they're afterward moved into cold storage. These are full-fidelity backups that can be restored for data recovery, etc. In an emergency, it would be theoretically possible to use one of these backups to launch a new instance of the Commons. But this assumes that the backups are accessible in this emergency, and assumes that the GC team is going to be available to stand up the backup.

So, to put this in terms of Colin's questions: We already take full-fidelity backups. But their utility during emergency downtime is not clear. It's likely they'd only be accessed if the site was down for many days, and that this downtime forced us to make an emergency (permanent) hosting change.

In a true emergency, it's probably going to be largely out of our team's hands how fast we get the site back up and running. And I think that we don't have anything approaching the necessary resources - money for servers, dev staff, IT support - that are required to have true "failover" capacity - ie, a version of the site that we could switch over to in an emergency.

What we can control, and what is thus something that we should prepare for, is a communication plan for our users. Here's what's already in place:

1. If our WordPress application goes down, but Apache is still up and running, we show a custom 500 error page. This is In certain kinds of emergencies, this page would remain available, and we'd be able to customize the message shown here.
2. We have the Commons status blog on Critically, this blog is hosted off the Commons, so that we're able to share information even if the Commons is completely unavailable. This blog is linked from the 500 error page, so in the kind of emergency where users are able to see the 500 page, they can follow the link. In other scenarios, it's likely that users will not have preexisting knowledge of the status blog, so we'd likely want to include it in any communication. For example, in an emergency where the Commons is complete inaccessible for several days, I can imagine posting updates to the status blog several times daily to give regular updates; in contrast, we might send a single email to a certain subset of Commons users, informing them about the situation and letting them know that they can watch the status blog for more information.

Perhaps the critical missing piece in our communication plan is that we have no way of sending emails to Commons users without, at the very least, having access to the MySQL database where email addresses are stored. Having an email list is probably going to a central piece of our communication plan in most any emergency, so it's likely our first priority to develop a plan for creating and maintaining an email list. Some considerations:

a. The list of emails belonging to Commons users is not static. New members sign up, while existing members change their addresses or even delete their accounts.
b. A list of 45k+ email addresses includes a great deal of sensitive contact information, a fact that we have to keep in mind as we decide how to store it, who has access, etc.
c. Most email addresses in the system belong to Commons members who have not logged into the site in a long time. They are no longer "active" users in any sense. Some may no longer be part of the CUNY community. Some may have passed away.

To mitigate some of these concerns, we might come up with heuristics that help us to narrow down a subset of users who it would be critical to contact in an emergency. Then, perhaps we can either (a) only store those email addresses, or (b) store all email addresses in a way that indicate whether a user is in one or more "important" categories. A few ideas about such heuristics:
- Users who have logged into the site in the last 6 or 12 months
- Users who are administrators of one or more sites (sites that have been active in the past x months?)
- Users who are admins of one or more groups (...that have been active in the past x months?)
- Members of groups or sites that have been active in the past x months
- Owners/members of sites/groups with the Teaching label

I imagine it's likely that, in a case where the Commons is unavailable, the first thing users will ask for is a list of email addresses for their groups/sites. I'm thinking primarily of an instructor who has to inform their students that the Commons is unavailable. If others agree, and if this is a service we want to be able to provide, then we need to have a more complicated mechanism for collecting and storing email addresses.

Because the data changes over time, we probably need a tool that can be run on a regular basis, either in an automated fashion or manually. We'd have to decide on an appropriate interval - say, once per month. Then, we have to decide where to store this information. Is, for example, a shared folder on Google Drive secure enough for this purpose?

These are some initial thoughts - hopefully they can help to get a conversation started.


Also available in: Atom PDF