Feature #3231

Feature #3230: Scripts for quicker provisioning/updating of development environments

"Clean" versions of production data

Added by Boone Gorges over 4 years ago. Updated about 3 years ago.

Priority name:
Category name:
Internal Tools and Workflow
Target version:
Start date:
Due date:
% Done:


Estimated time:
15.00 h


Some version of the Commons database is necessary to provision a new instance of the site for development purposes. Using a raw dump of the entire Commons database(s) is impractical for a number of reasons:

- It's extremely large
- It's divided between two databases (WP and MediaWiki)
- It contains a great deal of non-public and potentially sensitive data

I'd like to write a tool that will create a cleaned-up version of the production database, which will be safe and fairly easy to pass around for people to provision or update their local environments. Here's a sketch of what I have in mind:

- Ignore the wiki for the time being. Let's get it set up with WP and then worry about what to do with MW. (If you need a copy of the MW db in the meantime, we can figure something out)
- Off the top of my head, the following tasks need to be performed: * For all user accounts, reset passwords to something neutral (so it's easy to log in as a different user for testing) * For all user accounts, reset user_email to something fake (to avoid improper emails going out to actual addresses) * For all users, delete any xprofile data that is not set to be visible to the public (or maybe public + logged-in) * Delete all non-public groups, along with associated activity, forums, docs, files * Delete all blogs that are not set to be visible by the public (or maybe public + logged in) * For all blogs, delete all password-protected posts * For all blogs, delete all post drafts and post revisions (this is both for privacy reasons and to reduce the size of the database) * For all blogs, delete all non-published comments * Delete all private messages
- Obviously, since many of these checks will be WP-based, the "cleaner" will have to be a WordPress plugin of some sort. I'm thinking this is a good use for wp-cli.
- At the same time, we cannot perform any of these actions on the production database. So part of the cleaner will also have to be a tool that will export a raw version of the db and set up a parallel instance of WP (see #3230 for some ways forward with this). Much of this aspect could be manual for now.
- Most of this stuff is not specific to the Commons, so we should build this in a way that is fairly abstracted with the thought of making it available as a plugin

I'm soliciting feedback from the team on this, especially Ray. What items have I missed above, both in terms of privacy and in terms of paring down a huge database? Do you have any implementation ideas beyond (or in lieu of) what I've spelled out above?

(As a bit of background, we're in the process of bringing on one or two folks to work on the Commons in the fairly near future, so getting at least a semi-manual version of this cleaner up and running is somewhat of a priority.)


#1 Updated by Boone Gorges over 4 years ago

A first pass at this functionality is in I'm going to follow up privately with Dan to make sure the generated database dump is working correctly, and I'll post back here with details.

#2 Updated by Boone Gorges about 3 years ago

  • Status changed from Assigned to Resolved

Dan and I have run through a couple sets of exports over the past year or so, and the cac-database-cleaner tool is working pretty well. I'm going to mark this ticket as resolved.

Also available in: Atom PDF