Support #15176
openArchiving Q Writing & Old Wordpress Sites on the Commons
0%
Description
Hi Boone, Ray, All,
I am meeting with librarians and staff from Queens College later this week to talk about the use of Q-Writing and the Commons on their campus. I spoke with them last semester and they had asked about some form of migration or archiving of Q writing on the Commons.
Some background: Q Writing is a Wordpress multi-site platform that the Queens College Center for Teaching and Learning set up years ago. As far as I can tell, it is not as consistently used or maintained as the Commons or other campus-specific Wordpress platforms throughout CUNY. The Queens college folks have asked about the "possibility of incorporating QC’s current WordPress MU site into the Commons." Qwriting currently has 8,195 sites and 22,538 users.
I told them migrating or archiving Q writing in the Commons was probably not possible but when I spoke to Matt briefly about it, he mentioned I should ask about HTML flattening or other possibilities. The Queens college folks had thought about setting up a CBOX to replace Q writing but I told them, and they likely already knew, that setting CBOX presented that same maintenance and sustainability issues they were already experiencing.
Any insight about possibilities for archiving, flattening, etc would be appreciated. Even if we can only do some of the sites (or none), that would be helpful for me to know and share with them when we meet on Thursday.
Thanks!
Updated by Boone Gorges almost 3 years ago
- Status changed from New to Reporter Feedback
Hi Laurie - Thanks for your patience as I spent some time thinking about this issue.
For future sites at QC, I think we can simply recommend that QC users create Commons accounts and create their new content on the Commons. It's only existing/legacy content that we're worried about.
As you suggest, a straight migration of all of their 8000+ existing sites into the Commons is off the table:
- It would doubtless cause storage problems. It would represent a nearly 40% increase in the number of sites on the Commons, with associated increase in database and file storage space.
- The vast majority of the sites on Qwriting are "dormant" sites, which are no longer updated. So it would not be of much benefit to port them over as WP sites.
- Qwriting sites were created by Qwriting users, which don't exist on the Commons. In order for the migrated sites to be maintained, they'd need to have new WP users on the Commons. The logistics of this would be extremely complicated - think about the difficulty that users have accessing their CUNY email addresses, times 20,000+
- We don't offer the same plugins and themes as Qwriting, which means that some functionality/design will inevitably be lost in the migration.
The benefit of HTML flattening:
- Storage problems are mitigated. We need no database storage. We would need to store many tens of thousands of HTML pages; each would be fairly small, but it's be large in the aggregate. We'd need to copy over most static assets (images, PDF uploads, etc), so there wouldn't be any savings there. So the storage savings are modest, but worth noting.
- We don't need to worry about spam, hacking, etc for dormant sites.
- There's no need to migrate accounts.
- It doesn't matter whether we share plugins/themes, since the sites will not be powered by WP anymore.
Flattening every Qwriting site into static HTML would have some drawbacks:
- It would no longer be possible to update the sites, including leaving comments. This would be a problem for the minority of sites that are actively used after the first semester they're created (ongoing projects, personal sites, etc)
- We'd have to figure out a way to either handle redirects, or reconfigure the Qwriting DNS so that the URLs point to the new content. This would involve some negotiation with the QC IT department.
Setting aside the Commons for a moment, my qualified recommendation for the Qwriting team is that HTML flattening is a good idea for sites that need to be archived. Maybe make a semester-long roadmap: disable the creation of new sites on Qwriting beginning immediately; use the next few months to identify a small number of sites that could be manually migrated to a live WP installation like the Commons; then run the flattening sometime over the summer, after notifying users about the change and giving them an option to get WP-based exports. I might be able to help a bit with some of the manual migrations, depending on the details.
The question of where to store the static HTML is where the Commons comes in, and in my mind this is a separate question. On balance, I do not recommend that the Commons host a large number of static files like this.
- We have limited file storage capacity, which will be taxed by hosting a static flattening of Qwriting. This would also affect our backups.
- Our filesystem and server environment are optimized for serving dynamic WordPress sites, not pure HTML sites.
- It feels like the incorrect assignment of responsibility. QC should be responsible for maintaining its legacy content, not the Commons (which is hosted at the GC).
A better option, IMO, would be for the maintainers of Qwriting to use cheaper web hosting to serve these files. I know that Qwriting currently uses Pagely, which is quite expensive. For 5-10% of the monthly cost, it should be possible to get webhosting where you simply upload a bunch of HTML files and serve them from there. Better still, it's possible that QC would be willing to host this content internally, since it would require no database, would pose no spam/hacking threat, etc. I'm estimating it'd be 80-100GB of data, with limited traffic (I have access to the Qwriting server and I'm looking at their current storage).
As for the details of static site generation: I don't have a lot of experience with this. This plugin https://github.com/leonstafford/wp2static seems like it's pretty well-regarded (though no longer available in the WP repo for some reason). It comes with a CLI tool, so it should be possible to write a script that gets a list of all Qwriting sites, then runs the wp2static command for each of them. Most steps in the migration process, such as evaluating hosting, can be done by someone without a technical background. But the actual generation of static sites - given the complexities of Multisite, the very large size of Qwriting, the limitations of storage space in the Pagely environment, etc - mean that they'll probably need to work with a developer for at least this step. I imagine it would take a day or two of developer time to do it. I'm copying Ray on this ticket in case he has experience with other tools for static site generation, or insight into how to scale it to a large Multisite installation.
Updated by Laurie Hurson almost 3 years ago
Hi Boone,
Thanks so much for this insight. I am passing along these ideas and suggestions to the QWriting team.
I think the Queens team in aware and onboard with not moving Qwriting or flattene files over on to the commons. They recognize the storage issues and plan to continue to host Qwriting outside the Commons. I will suggest possibly changing their host to reduces costs.
That said, they do want to essentially shut down new use of Qwriting start to transition Queens users over to the Commons. I am going to work with them on support and onboarding for Queens users once this process begins. I will keep you posted on timeline, but I think sometime this summer.
A few follow up questions:
- Is it technically possible to flatten many of the dormant QWriting sites while leaving the active sites active on the platform? For example, to save space they flatten 90% of the sites on QWriting but a few active sites want to stay on the platform and continue as usual. All users of dormant sites and new users would be direct to join the Commons to continue using a CUNY WordPress platform. What do you think about this?
- Would it be possible to port QWriting users over into the Commons - so batch create commons accounts based on Qwriting credentials? I know you mentioned this but wanted to get more insight on possibilities. The Queens team thought it might be helpful if people could use thier Qwriting account to log into the commons, which means there would need to be a Commons username/password that is the same as the Qwriting account.
Thanks again for your insight on this.
Updated by Boone Gorges almost 3 years ago
- Is it technically possible to flatten many of the dormant QWriting sites while leaving the active sites active on the platform? For example, to save space they flatten 90% of the sites on QWriting but a few active sites want to stay on the platform and continue as usual. All users of dormant sites and new users would be direct to join the Commons to continue using a CUNY WordPress platform. What do you think about this?
Yes, but it would take some custom technical work to make it happen. You'd have to identify the 10% to keep active; write a tool that flattens the other 90% and deletes them from the WP instance (including their uploaded files, which is what takes up space); move those flattened files to a new host; and get the URLs of the 90% redirected/rewritten to the new location. The latter could be done in conjunction with QC IT, or through configuration of the Pagely server. I wouldn't hold my breath that QC IT would work through this, so it'd likely fall on the QC team to handle the redirects from Pagely. This redirect setup is inherently more complicated than if you were to redirect all Qwriting traffic. In this way, the staged movement is more complex than doing the whole thing at once. If the goal is to save a few months' worth of storage costs from Pagely, I'd suggest it's probably not worth it, since they'll end up paying more to a developer to do this extra work.
- Would it be possible to port QWriting users over into the Commons - so batch create commons accounts based on Qwriting credentials? I know you mentioned this but wanted to get more insight on possibilities. The Queens team thought it might be helpful if people could use thier Qwriting account to log into the commons, which means there would need to be a Commons username/password that is the same as the Qwriting account.
It's technically possible to batch-create accounts using the same user names/email addresses. It would be complex, because we'd need to identify duplicates, map metadata (like display names) onto the Commons, etc. It's not technically possible to keep the credentials, because passwords are encrypted using a one-way algorithm.
Updated by Raymond Hoh almost 3 years ago
After HTML static generation, the Q Writing team might want to consider hosting their archived sites on Github Pages. It's free, allows for custom domains and should scale well. The one problem could be storage. Each Github Pages site allows up to 1GB of storage and has a soft bandwidth limit of 100GB per month: https://docs.github.com/en/pages/getting-started-with-github-pages/about-github-pages#usage-limits.
Could potentially use a subdirectory set up depending if the storage size of the cumulative sites is < 1GB. Then the main domain could be archive.qwriting.gc.cuny.edu
. And a site like http://qcvoices.qwriting.qc.cuny.edu/ could reside at archive.qwriting.gc.cuny.edu/qcvoices
.
I also took a quick look on the QWriting Social site and there is a fair bit of spam: https://social.qwriting.qc.cuny.edu/activity/
I would first do an audit to try and remove as many spam users and sites as possible before looking at archiving. Also look at removing any sites that are test sites or do not contain any content. I would also highly suggest turning off new user registrations and new blog signups as soon as possible.
Updated by Boone Gorges almost 3 years ago
Thanks, Ray! I hadn't thought of GitHub Pages.
The space issue is likely to be the pain point. In my experience with Qwriting, the vast majority of storage space is taken up by faculty who have uploaded PDFs, images, videos, and other media to share with students. This is all legitimate data that wouldn't be eliminated by deleting spam sites. And these uploads are likely to drive the cumulative space required into the tens of gigabytes.
Updated by Raymond Hoh almost 3 years ago
I think it would be good to determine the top five sites that have the largest storage sizes and to see if the uploaded files are actually referenced on the site or not.
I'm guessing of the 8000 sites, only some of them are really large whereas some are small, which might offset the total cumulative storage.
Could also run the largest sites through some optimizer plugins as described here: https://wpengine.com/resources/wordpress-media-library-clean-up/#Plugins_For_Image_Cleanup
Updated by Laurie Hurson almost 3 years ago
Hi Boone and Ray
Thanks so much for your sharing your insight on this.
The most recent update on QWriting is that Queens IT pulled it offline citing it as a cyber security threat. But, as Boone mentioned on the dev call this Tuesday, it is likely still accessible through the host, Pagely.
I have several folks reaching out to me about moving sites from Qwriting onto the Commons and I have asked them if it is possible to get into the platform to export the site content as an xml file, in the same way one would usually move a wordpress site.
They have two questions that I am not sure about:
1. "If we were able to secure a WP XML backup from a particular site or sites, would we be able to import that into the Commons, including all assets (jpgs, docs, pdfs etc) that may or may not now have new URL redirects?"
-- I believe the answer to this is yes? Would this be any different than moving any other wordpress site? Does this process change if the DNS has been disabled?
2. "In a hypothetical situation where there was a backup of a multi-site install in SQL format, is it difficult to extract specific, individual sites?"
- No idea on this one. Thoughts?
Updated by Boone Gorges almost 3 years ago
I have several folks reaching out to me about moving sites from Qwriting onto the Commons and I have asked them if it is possible to get into the platform to export the site content as an xml file, in the same way one would usually move a wordpress site.
It's not possible to get an XML export of the entire network, and even if you could, it would be so large as to be unworkable. It must be done on a per-site basis.
1. "If we were able to secure a WP XML backup from a particular site or sites, would we be able to import that into the Commons, including all assets (jpgs, docs, pdfs etc) that may or may not now have new URL redirects?"
The basics of importing content would work the same, but uploaded files/assets would not be pulled over. The importer tries to fetch them from their public URL, but they no longer have those public URLs because of the DNS changes.
2. "In a hypothetical situation where there was a backup of a multi-site install in SQL format, is it difficult to extract specific, individual sites?"
It's moderately difficult, and tedious enough that it wouldn't scale beyond a small number of sites. Part of the difficulty is the sheer size of a SQL export of the entire multisite installation. If I were doing this, I would extract only those tables belonging to the sites that I actually wanted to move, a mitigation that would help with the size issues. (This would take the form of mysqldump
calls that pass in specific table names corresponding to the site in question; I'd then script it so that you could do it for multiple sites.) Once we had the site-specific export, we'd then need to do something like the following:
a. Create a new site on the Commons through the normal user flow. This will help ensure proper entries in wp_blogs and also that there's a properly sequential ID assigned to the site.
b. Note the ID of the new site, and its corresponding database prefix (such as wp_10399_
for a site with ID 10399)
c. Run a search and replace on the SQL export to swap the original table prefix for the new one
d. Import into the Commons database, overwriting the tables created in step (a)
e. Use the wp search-replace
utility to swap out the old URL for the new URL in the site tables
f. Use rsync or some other mechanism to pull over the upload directory and put it on the Commons
g. At this point you would have a mostly functioning site, but user IDs (as in post_author
) would be incorrect. You'd need to map Qwriting users onto Commons users and then have a utility that does the swap-out.
h. There would probably continue to be some issues, because the plugins and themes for the original site may not be present on the Commons. This would require manual cleanup.
Something like this would need to be done for each site. It's possible to build tools that automate some parts of it, but a great deal of manual intervention is going to be required no matter what. Most of the steps outlined here would need to be performed by the Commons team (ie me), and we have very limited resources for helping with large projects like this, so it'd have to take place with a very limited number of sites.
A more feasible route might be to ask QC IT to reenable the DNS for a specified window of time, and then during that window, to manually run the XML export-import process. This would allow WP to pull over the attachments via their old URLs.
Updated by Laurie Hurson almost 3 years ago
Hi Boone,
Thanks so much for working through this with me.
Understood re: sql migration being too cumbersome.
QC folks have just learned that " ITS will restore a read only version of our QWriting sites tonight that will be available until February 6."
With a read-only version of the platform, is xml export/import of sites possible?
Updated by Boone Gorges almost 3 years ago
With a read-only version of the platform, is xml export/import of sites possible?
Depends what "read only" means. If they're blocking access to the WP Dashboard somehow, then it probably won't work, because they won't have the ability to log in to the Dashboard where the Export tool lives. (It could be done via the command line, but it sounds like this is probably not possible for the Qwriting team.)
I'm skeptical that QC ITS knows enough about the internals of WP to block logins, or to block content from being created on the site. So I don't know. I guess the best they can do is wait for it to be restored, and then try logging in to get an export. If you can get the XML file, then importing into a different WP site should work fine, as long as the content (images, etc) is available under its old URL.
Updated by Laurie Hurson almost 3 years ago
Yes, good point re: how the read-only format will mean for exporting. I am meeting with one group next week and will know more then.
In the meantime, another question: I have another group asking about using the servmask plugin. "We’d like to use the ServMask Plug in (https://servmask.com/) with the multi-site extension. We’re willing to purchase the plugin extension, however before we do, we need to know if you would be willing to install the free Plug in and activate it for the new sites so that they can import the packages that we create."
I am not sure why they would use this instead of the built-in import/export if that is available.
I told them we cannot move a large amount of sites into the Commons (i.e. we cannot move all of QWriting into the Commons) and that we cannot set up a separate multisite since the commons is already a multisite itself. I also told them that the Commons cannot import packages of code.
I asked them to clarify their rationale for using servmask. I am betting there are issues with this and that we likely cannot support the plugin but wanted to ask for your insight.
Updated by Boone Gorges almost 3 years ago
For my own reference, it looks like ServMask is the new name (or parent company?) for the plugin All-in-One WP Migration.
When you say that "another group" is asking for this, do you mean another group from QC?
Your response is correct. We can't set up a tool for migrating sites en masse to the Commons. We're not set up to be the destination for mass migrations. If we decide to do more than a small handful of sites, then I can help to work with the users to determine the best technical path forward. But this will be a task and a decision for the Commons technical team; we won't be installing a self-serve plugin for users to handle their own migrations.
I used All-in-One years ago and found that there were problems in Multisite environments. These sorts of tools often make assumptions about available server libraries, available disk space, file permissions, and other architectural issues that run up into ideosyncracies of the Commons.
Updated by Laurie Hurson almost 3 years ago
I am currently working with two groups from QC - the English department, who I met with today, and a second group who I believe is part of Queens ITS. I am meeting the second group on Wednesday.
The English dept faculty are using the wordpress xml import/export function to move their sites on to the Commons. From my the latest email with the ITS people, I believe they are also going to manually move sites over using the XML export/import feature.
I think we are going to have various QC faculty moving sites onto the Commons this week using the xml import/export function. As far as I know, QWriting is getting taken offline on Feb 6th.
Sorry to keep asking questions about this, I just want to make sure I understand what is possible--
On 1/25 you noted:
The basics of importing content would work the same, but uploaded files/assets would not be >pulled over. The importer tries to fetch them from their public URL, but they no longer have those >public URLs because of the DNS changes.
Will the XML export/import tool only work as long as Qwriting is accessible and available online? Or does this mean that the assets like PDF and jpegs will only transfer if Qwriting is online?
If the xml files need to be imported before qwriting goes offline, I definitely want to let the qwriting folks know.
I have added two QC folks, Leila Walker (Digital Scholarship Librarian and former QWriting admin) and Kevin Ferguson (English dept and former QWriting admin) to this ticket so they can follow along.
Updated by Boone Gorges almost 3 years ago
Will the XML export/import tool only work as long as Qwriting is accessible and available online? Or does this mean that the assets like PDF and jpegs will only transfer if Qwriting is online?
There are two steps: export, and then import.
In order to export an XML file, you must be able to access the Qwriting site. It's possible to generate an XML export via the SSH command line, so it's possible that this step could be done after the DNS is switched off. But to do it through the web interface, DNS must be hooked up.
In order to import an XML file, it is not strictly necessary for the Qwriting site to be available. All post content and metadata is contained in the XML file. However, when importing an XML, if you've checked the 'Download attachments' box, the WP importer will attempt to fetch all attached items (images, pdfs, etc) from the source site. In order for this to work, the source site must be available. For this reason, I strongly suggest that all necessary imports happen before QC disconnects the DNS again.
Updated by Laurie Hurson almost 3 years ago
Thanks for this Boone. I have advised the QC folks that importing their xml while QWriting is still online will likely allow them to retain the media from their sites.
Just an update on this whole situation - I expect we will see faculty and staff moving sites over to the Commons over the next week or so. From what I understand, QC ITS is setting up a process where QC faculty and staff can request XML exports from QWriting. It sounds like ITS was open to leaving Qwriting online for a bit longer to allow faculty to move sites+media over, but I don't know if this will pan out.
I am running an intro to the CAC and support session for QC people tomorrow, Friday at 3pm and next Tuesday at 4pm, so there may be more traffic at that time, but I am not going to have everyone import at the same time since I can see this causing issues.
Is there anything else I should be aware of as I continue to provide support for QC? I can provide more of an update tomorrow in the team meeting if that would be helpful.
Updated by Matt Gold almost 3 years ago
You are the best, Laurie. Thank you for all of your amazing work!!!!
Updated by Boone Gorges almost 3 years ago
Thanks for the update, Laurie. I don't think I've got any other advice at the moment. Please do provide an update during tomorrow's call.
Updated by Laurie Hurson almost 3 years ago
Thanks for talking all of this through today.
Kevin Ferguson just sent me this question and I wanted to pass on here (I think it is somewhat related to what we were discussing earlier today):
Kevin: "Hypothetically, if one had a Pagely backup of the whole multisite (Pagely provides SQL and files), would there be a way to automate exporting individual site XML? Or would it require setting up the whole site locally and invidually exporting individual XML?"
Updated by Boone Gorges almost 3 years ago
Kevin: "Hypothetically, if one had a Pagely backup of the whole multisite (Pagely provides SQL and files), would there be a way to automate exporting individual site XML? Or would it require setting up the whole site locally and invidually exporting individual XML?"
No, it can't be automated. The XMLs are built dynamically from info in the database, so you have to have a working WP installation. However, if you have a local installation, it's possible to use the WP-CLI command-line utility to generate the XML files very quickly:
$ wp --url=example.qwriting.qc.cuny.edu export ~/example.xml
which could be scripted for a larger number of sites. In fact, this could be done on Pagely's server, with no need for a local installation. It might help to speed up the process somewhat, rather than logging in and using the Dashboard UI.
Updated by Laurie Hurson almost 3 years ago
Thanks for this info Boone, I passed it on to Kevin. Running my second workshop with the QC folks this afternoon and will share and relevant updates/info.