Feature #3231
closedFeature #3230: Scripts for quicker provisioning/updating of development environments
"Clean" versions of production data
0%
Description
Some version of the Commons database is necessary to provision a new instance of the site for development purposes. Using a raw dump of the entire Commons database(s) is impractical for a number of reasons:
- It's extremely large
- It's divided between two databases (WP and MediaWiki)
- It contains a great deal of non-public and potentially sensitive data
I'd like to write a tool that will create a cleaned-up version of the production database, which will be safe and fairly easy to pass around for people to provision or update their local environments. Here's a sketch of what I have in mind:
- Ignore the wiki for the time being. Let's get it set up with WP and then worry about what to do with MW. (If you need a copy of the MW db in the meantime, we can figure something out)
- Off the top of my head, the following tasks need to be performed:
* For all user accounts, reset passwords to something neutral (so it's easy to log in as a different user for testing)
* For all user accounts, reset user_email to something fake (to avoid improper emails going out to actual addresses)
* For all users, delete any xprofile data that is not set to be visible to the public (or maybe public + logged-in)
* Delete all non-public groups, along with associated activity, forums, docs, files
* Delete all blogs that are not set to be visible by the public (or maybe public + logged in)
* For all blogs, delete all password-protected posts
* For all blogs, delete all post drafts and post revisions (this is both for privacy reasons and to reduce the size of the database)
* For all blogs, delete all non-published comments
* Delete all private messages
- Obviously, since many of these checks will be WP-based, the "cleaner" will have to be a WordPress plugin of some sort. I'm thinking this is a good use for wp-cli.
- At the same time, we cannot perform any of these actions on the production database. So part of the cleaner will also have to be a tool that will export a raw version of the db and set up a parallel instance of WP (see #3230 for some ways forward with this). Much of this aspect could be manual for now.
- Most of this stuff is not specific to the Commons, so we should build this in a way that is fairly abstracted with the thought of making it available as a plugin
I'm soliciting feedback from the team on this, especially Ray. What items have I missed above, both in terms of privacy and in terms of paring down a huge database? Do you have any implementation ideas beyond (or in lieu of) what I've spelled out above?
(As a bit of background, we're in the process of bringing on one or two folks to work on the Commons in the fairly near future, so getting at least a semi-manual version of this cleaner up and running is somewhat of a priority.)
Updated by Boone Gorges over 10 years ago
A first pass at this functionality is in https://github.com/cuny-academic-commons/cac-database-cleaner. I'm going to follow up privately with Dan to make sure the generated database dump is working correctly, and I'll post back here with details.
Updated by Boone Gorges almost 9 years ago
- Status changed from Assigned to Resolved
Dan and I have run through a couple sets of exports over the past year or so, and the cac-database-cleaner tool is working pretty well. I'm going to mark this ticket as resolved.