Project

General

Profile

Actions

Feature #21383

open

Feature #21380: Hosting migration

Offload media using S3-Uploads

Added by Boone Gorges 20 days ago. Updated 9 days ago.

Status:
New
Priority name:
Normal
Assignee:
Category name:
Server
Target version:
Start date:
2024-11-01
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

As part of our migration to Reclaim, we will be offloading our media files to Amazon S3. This is necessary for cost reasons, as well as for compatibility with load-balancing and other high-availability infrastructure at Reclaim.

Reclaim has requested that we use the following tool from Human Made: https://github.com/humanmade/S3-Uploads

Our first task is to gauge compatibility between this tool and the various parts of the Commons. As a starting place, here's a list of concerns:

1. We currently have a custom tool that uses a dynamically-generated .htaccess file to protect files uploaded to a private site. See https://github.com/cuny-academic-commons/cac/blob/2.4.x/wp-content/mu-plugins/cac-file-protection.php. We've got to determine whether this will continue to be compatible with S3-Uploads. My initial guess is that it won't, since S3-Uploads filters attachment URLs. Related, S3-Uploads allows uploaded files to be "private" https://github.com/humanmade/S3-Uploads?tab=readme-ov-file#private-uploads. I don't really understand what this does, so we'll have to research and understand whether it accomplishes something similar, and if so, how we migrate to it.

2. We allow file uploads of several types that aren't related to the WP Media Library. On the primary site, this includes user avatars, group avatars, forum attachments, buddypress-group-documents. On secondary sites, it might include various plugins that use a non-standard technique for accepting uploads (see eg Gravity Forms). We need to figure out what S3-Uploads means for all of these. It's possible that S3-Uploads won't interfere with them at all - ie, files will continue to be uploaded to and served from the web server. If so, we'll have to determine whether this is OK (in terms of performance, backups, cost, etc). The answer may differ depending on file type: I can image, for example, that it'd be OK to keep avatars on the web server, but that we'd be more motivated to move (potentially much larger) buddypress-group-documents to S3.

3. Reclaim has suggested that our team may want to roll out S3-Uploads integration before we do our final migration. There are a couple of reasons to like this idea: it reduces the number of moving parts on migration day, and it gives us plenty of lead time to upload existing files (1+ TB worth) to S3 well in advance of the launch date. Our team needs to decide whether this is feasible, and if so, when it will happen. Reclaim is serving as our vendor for AWS (ie we're paying Reclaim, and they're paying AWS), so we would need Reclaim to help us configure our bucket(s) in order for us to move forward with this.

Ray and Jeremy, I've never run a large site with S3-offloaded content, and I've definitely never run a migration of an existing site. Have you? It would be great to get your impression of the project, and your warnings about potential problems that I haven't discussed above.


Related issues

Related to CUNY Academic Commons - Bug #21483: CV cover/profile image URL generation not compatible with S3-UploadsNewJeremy Felt2024-11-13

Actions
Actions #1

Updated by Raymond Hoh 17 days ago

Ray and Jeremy, I've never run a large site with S3-offloaded content, and I've definitely never run a migration of an existing site. Have you?

I've worked on a multisite site that uses S3 Uploads. We maintain a fork that addresses a few issues. See
https://github.com/humanmade/S3-Uploads/compare/master...hwdsb:S3-Uploads:hwdsb-mods.

For the Commons, this would namely include the following:
- Fixes an issue with some older multisite URLs that use /blogs.dir/ in their uploads directory: https://github.com/WordPress/WordPress/blob/master/wp-includes/ms-default-constants.php#L31. The /blogs.dir/ uploads directory would apply for sites that existed before WordPress MU was merged into WordPress Core. S3 Uploads does not take this into account. We probably have older sites that use /blogs.dir/ for their uploads directory as well so this issue would apply to us as well. I forget the particulars of this issue, but I note this just so we are aware.
- Incompatibility with Gravity Forms. As you mentioned in point 2, there is an issue here with some non-standard plugins. I didn't look too far into the actual issue with Gravity Forms. I'm doing a dirty bail fix in the s3-uploads fork.
- Image URL rewriting via filters needed to be done as well for themes using a custom header image and background image since these URLs are written into the DB as theme mods and these URLs were referencing the local URLs instead of the S3 URLs.
- Also for this site, BuddyPress avatar uploads remain being served locally.

Reclaim is serving as our vendor for AWS (ie we're paying Reclaim, and they're paying AWS), so we would need Reclaim to help us configure our bucket(s) in order for us to move forward with this.

Do we have an estimate on the potential cost of using AWS? This could be in the thousands of dollars per year.

Actions #2

Updated by Boone Gorges 17 days ago

Ray, thanks so much for this!

- Fixes an issue with some older multisite URLs that use /blogs.dir/ in their uploads directory

Do you think Human Made would accept a PR for this? Seems like a general problem.

Do we have an estimate on the potential cost of using AWS? This could be in the thousands of dollars per year.

It's rolled into the top-line number that Reclaim is charging us. This is by design: I didn't want our team to be responsible for covering these variable costs, not to mention the overhead associated with configuring, maintaining, troubleshooting, itc.

Actions #3

Updated by Boone Gorges 10 days ago

Reclaim has asked what we'd like to use as the rewrite domain for S3-stored uploads. By default, S3 URLs are long and unwieldy, but we can rewrite them as something like files.commons.gc.cuny.edu/sites/1234/2024/01/foo.jpg. Are we OK with using files.commons.gc.cuny.edu for this purpose, or is there a better idea floating out there? We should decide this soon, because it'll require a DNS change at the Graduate Center, and I would like to be able to include this ask in our initial round of communication with GC IT.

Actions #4

Updated by Boone Gorges 9 days ago

S3-Uploads is now running in the Reclaim dev environment.

At first, I tried running Ray's fork. But this caused a couple problems. First, it used an old copy of the AWS SDK, which made it difficult to debug. Second, the blogs.dir fixes didn't work right; from my reading, they assume that the webserver kept the upload files in blogs.dir, but that on S3 they'd be in the /uploads/sites/ bucket. I don't think it's necessary to do this: we can simply upload the blogs.dir and the uploads buckets separately, and everything appears to work correctly with the latest S3-Uploads, out of the box.

Basic media uploads, as well as BP avatars, appear to be working properly. I'll be assembling a list in the upcoming days of all upload types I can think of, and then I'll run some preliminary tests on each. Based on those tests, we can decide whether it's necessary and/or desirable to filter non-standard upload types to S3.

As a reminder, in today's dev call we discussed point 3 above and tentatively decided to do the following:
a. Sometime in the upcoming weeks, get the production S3 bucket set up
b. Modify the S3-Uploads plugin so that it can upload existing content to the bucket, and also sync new uploads. Activate that modified plugin on the legacy production site.
c. Run the upload-directory tool in the s3-uploads plugin. Start with small amounts of content to ensure it doesn't crash the production site.
d. When the Reclaim production environment is created and we're near switchover time, hook it up to the pre-filled S3 bucket.

I'll begin working on the necessary mods to the S3-Uploads plugin, and I'll share them here for review before we deploy them to the legacy site.

Actions #5

Updated by Raymond Hoh 9 days ago

At first, I tried running Ray's fork. But this caused a couple problems. First, it used an old copy of the AWS SDK, which made it difficult to debug. Second, the blogs.dir fixes didn't work right; from my reading, they assume that the webserver kept the upload files in blogs.dir, but that on S3 they'd be in the /uploads/sites/ bucket. I don't think it's necessary to do this: we can simply upload the blogs.dir and the uploads buckets separately, and everything appears to work correctly with the latest S3-Uploads, out of the box.

I think the other issue with the site I was working on is they were migrating from another S3 storage plugin, WP Offload Media, over to Human Made's S3 Uploads and they were using an existing S3 bucket rather than starting from scratch. I'm glad that this isn't a problem with our install!

Basic media uploads, as well as BP avatars, appear to be working properly.

Glad that this is working as well!

Actions #6

Updated by Boone Gorges 8 days ago

  • Related to Bug #21483: CV cover/profile image URL generation not compatible with S3-Uploads added
Actions

Also available in: Atom PDF