Project

General

Profile

Actions

Feature #21383

open

Feature #21380: Hosting migration

Offload media using S3-Uploads

Added by Boone Gorges about 2 months ago. Updated 4 days ago.

Status:
New
Priority name:
Normal
Assignee:
Category name:
Server
Target version:
Start date:
2024-11-01
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

As part of our migration to Reclaim, we will be offloading our media files to Amazon S3. This is necessary for cost reasons, as well as for compatibility with load-balancing and other high-availability infrastructure at Reclaim.

Reclaim has requested that we use the following tool from Human Made: https://github.com/humanmade/S3-Uploads

Our first task is to gauge compatibility between this tool and the various parts of the Commons. As a starting place, here's a list of concerns:

1. We currently have a custom tool that uses a dynamically-generated .htaccess file to protect files uploaded to a private site. See https://github.com/cuny-academic-commons/cac/blob/2.4.x/wp-content/mu-plugins/cac-file-protection.php. We've got to determine whether this will continue to be compatible with S3-Uploads. My initial guess is that it won't, since S3-Uploads filters attachment URLs. Related, S3-Uploads allows uploaded files to be "private" https://github.com/humanmade/S3-Uploads?tab=readme-ov-file#private-uploads. I don't really understand what this does, so we'll have to research and understand whether it accomplishes something similar, and if so, how we migrate to it.

2. We allow file uploads of several types that aren't related to the WP Media Library. On the primary site, this includes user avatars, group avatars, forum attachments, buddypress-group-documents. On secondary sites, it might include various plugins that use a non-standard technique for accepting uploads (see eg Gravity Forms). We need to figure out what S3-Uploads means for all of these. It's possible that S3-Uploads won't interfere with them at all - ie, files will continue to be uploaded to and served from the web server. If so, we'll have to determine whether this is OK (in terms of performance, backups, cost, etc). The answer may differ depending on file type: I can image, for example, that it'd be OK to keep avatars on the web server, but that we'd be more motivated to move (potentially much larger) buddypress-group-documents to S3.

3. Reclaim has suggested that our team may want to roll out S3-Uploads integration before we do our final migration. There are a couple of reasons to like this idea: it reduces the number of moving parts on migration day, and it gives us plenty of lead time to upload existing files (1+ TB worth) to S3 well in advance of the launch date. Our team needs to decide whether this is feasible, and if so, when it will happen. Reclaim is serving as our vendor for AWS (ie we're paying Reclaim, and they're paying AWS), so we would need Reclaim to help us configure our bucket(s) in order for us to move forward with this.

Ray and Jeremy, I've never run a large site with S3-offloaded content, and I've definitely never run a migration of an existing site. Have you? It would be great to get your impression of the project, and your warnings about potential problems that I haven't discussed above.


Related issues

Related to CUNY Academic Commons - Bug #21483: CV cover/profile image URL generation not compatible with S3-UploadsNewJeremy Felt2024-11-13

Actions
Related to CUNY Academic Commons - Bug #21666: Forum attachment URLs in group library are not filtered by s3-uploadsRejectedBoone Gorges2024-12-18

Actions
Actions #1

Updated by Raymond Hoh about 2 months ago

Ray and Jeremy, I've never run a large site with S3-offloaded content, and I've definitely never run a migration of an existing site. Have you?

I've worked on a multisite site that uses S3 Uploads. We maintain a fork that addresses a few issues. See
https://github.com/humanmade/S3-Uploads/compare/master...hwdsb:S3-Uploads:hwdsb-mods.

For the Commons, this would namely include the following:
- Fixes an issue with some older multisite URLs that use /blogs.dir/ in their uploads directory: https://github.com/WordPress/WordPress/blob/master/wp-includes/ms-default-constants.php#L31. The /blogs.dir/ uploads directory would apply for sites that existed before WordPress MU was merged into WordPress Core. S3 Uploads does not take this into account. We probably have older sites that use /blogs.dir/ for their uploads directory as well so this issue would apply to us as well. I forget the particulars of this issue, but I note this just so we are aware.
- Incompatibility with Gravity Forms. As you mentioned in point 2, there is an issue here with some non-standard plugins. I didn't look too far into the actual issue with Gravity Forms. I'm doing a dirty bail fix in the s3-uploads fork.
- Image URL rewriting via filters needed to be done as well for themes using a custom header image and background image since these URLs are written into the DB as theme mods and these URLs were referencing the local URLs instead of the S3 URLs.
- Also for this site, BuddyPress avatar uploads remain being served locally.

Reclaim is serving as our vendor for AWS (ie we're paying Reclaim, and they're paying AWS), so we would need Reclaim to help us configure our bucket(s) in order for us to move forward with this.

Do we have an estimate on the potential cost of using AWS? This could be in the thousands of dollars per year.

Actions #2

Updated by Boone Gorges about 2 months ago

Ray, thanks so much for this!

- Fixes an issue with some older multisite URLs that use /blogs.dir/ in their uploads directory

Do you think Human Made would accept a PR for this? Seems like a general problem.

Do we have an estimate on the potential cost of using AWS? This could be in the thousands of dollars per year.

It's rolled into the top-line number that Reclaim is charging us. This is by design: I didn't want our team to be responsible for covering these variable costs, not to mention the overhead associated with configuring, maintaining, troubleshooting, itc.

Actions #3

Updated by Boone Gorges about 1 month ago

Reclaim has asked what we'd like to use as the rewrite domain for S3-stored uploads. By default, S3 URLs are long and unwieldy, but we can rewrite them as something like files.commons.gc.cuny.edu/sites/1234/2024/01/foo.jpg. Are we OK with using files.commons.gc.cuny.edu for this purpose, or is there a better idea floating out there? We should decide this soon, because it'll require a DNS change at the Graduate Center, and I would like to be able to include this ask in our initial round of communication with GC IT.

Actions #4

Updated by Boone Gorges about 1 month ago

S3-Uploads is now running in the Reclaim dev environment.

At first, I tried running Ray's fork. But this caused a couple problems. First, it used an old copy of the AWS SDK, which made it difficult to debug. Second, the blogs.dir fixes didn't work right; from my reading, they assume that the webserver kept the upload files in blogs.dir, but that on S3 they'd be in the /uploads/sites/ bucket. I don't think it's necessary to do this: we can simply upload the blogs.dir and the uploads buckets separately, and everything appears to work correctly with the latest S3-Uploads, out of the box.

Basic media uploads, as well as BP avatars, appear to be working properly. I'll be assembling a list in the upcoming days of all upload types I can think of, and then I'll run some preliminary tests on each. Based on those tests, we can decide whether it's necessary and/or desirable to filter non-standard upload types to S3.

As a reminder, in today's dev call we discussed point 3 above and tentatively decided to do the following:
a. Sometime in the upcoming weeks, get the production S3 bucket set up
b. Modify the S3-Uploads plugin so that it can upload existing content to the bucket, and also sync new uploads. Activate that modified plugin on the legacy production site.
c. Run the upload-directory tool in the s3-uploads plugin. Start with small amounts of content to ensure it doesn't crash the production site.
d. When the Reclaim production environment is created and we're near switchover time, hook it up to the pre-filled S3 bucket.

I'll begin working on the necessary mods to the S3-Uploads plugin, and I'll share them here for review before we deploy them to the legacy site.

Actions #5

Updated by Raymond Hoh about 1 month ago

At first, I tried running Ray's fork. But this caused a couple problems. First, it used an old copy of the AWS SDK, which made it difficult to debug. Second, the blogs.dir fixes didn't work right; from my reading, they assume that the webserver kept the upload files in blogs.dir, but that on S3 they'd be in the /uploads/sites/ bucket. I don't think it's necessary to do this: we can simply upload the blogs.dir and the uploads buckets separately, and everything appears to work correctly with the latest S3-Uploads, out of the box.

I think the other issue with the site I was working on is they were migrating from another S3 storage plugin, WP Offload Media, over to Human Made's S3 Uploads and they were using an existing S3 bucket rather than starting from scratch. I'm glad that this isn't a problem with our install!

Basic media uploads, as well as BP avatars, appear to be working properly.

Glad that this is working as well!

Actions #6

Updated by Boone Gorges about 1 month ago

  • Related to Bug #21483: CV cover/profile image URL generation not compatible with S3-Uploads added
Actions #7

Updated by Boone Gorges 18 days ago

I've got the ACL tooling working on the Reclaim dev server. Here's the s3-uploads modification plugin I built: https://github.com/cuny-academic-commons/cac/blob/reclaim-migration/wp-content/mu-plugins/s3-uploads.php

For the sake of posterity, I'm going to outline some of the work I had to do.

1. TIL that S3 bucket policies are read before object-level ACL. Reclaim had originally set up the bucket using a policy that allowed read access to everything in it. As such, my ACL rules were being ignored. I changed the policy to https://gist.github.com/boonebgorges/bad405c0f41bf3c6dc62d2003437fef3, which means that: (a) items with public ACL are visible to anyone, (b) items with private ACL are not visible to anyone. S3 URLs that are signed are essentially "authenticated", which means that the bucket policy is bypassed. This is the mechanism that allows private files to be served, since S3-Uploads provides signed URLs for non-private objects when WP renders the page.

2. The first method https://github.com/cuny-academic-commons/cac/blob/206b834879405d6f62e65761fc5b9ea4c61c4ade/wp-content/mu-plugins/s3-uploads.php#L7-L25 forces attachments to be "private" (from the POV of s3-uploads) whenever the current site has blog_public < 0. In the future, we may decide to introduce other situations where items are made private, using this mechanism.

3. The plugin natively handles ACL setting at the time of upload. My secondary plugin handles switching a site's attachments' ACL when blog_public is changed. See https://github.com/cuny-academic-commons/cac/blob/206b834879405d6f62e65761fc5b9ea4c61c4ade/wp-content/mu-plugins/s3-uploads.php#L127-L229. ACL changes must be performed not just for each attachment, but also for each image size. This can result in a huge number of requests and a lot of latency. So I put it into an async process, which runs 10 attachments in a batch on a self-refreshing cron job. It's pretty rudimentary but I think it's probably good enough for now.

4. I ran into a bunch of problems related to SSL certs. The root issue is that Reclaim created our dev bucket with the name files.cunyac.reclaimhosting.dev. This bucket name has periods in it. As such, when S3 assembles its default URL structure using the [bucket].s3.amazonaws.com format, the result is files.cunyac.reclaimhosting.dev.s3.amazonaws.com. These dots mean that we're not subsumed under AWS's wildcard *.s3.amazonaws.com SSL cert, which means that requests fail in a bunch of different ways. I worked around this using a few filters https://github.com/cuny-academic-commons/cac/blob/206b834879405d6f62e65761fc5b9ea4c61c4ade/wp-content/mu-plugins/s3-uploads.php#L27-L125, and also with a modification in the plugin itself (in get_s3_location_for_url()):

        // CAC mod: We are using "path" mode rather than virtual-hosted-style URLs.
        if ( strpos( $url, 'https://s3.amazonaws.com/' ) === 0 ) {
            $parsed = wp_parse_url( $url );

            $key = str_replace( '/'. $this->get_s3_bucket() . '/', '', $parsed['path'] );

            return [
                'bucket' => $this->get_s3_bucket(),
                'key'    => $key,
                'query'  => $parsed['query'] ?? null,
            ];
        }

It's likely that at least some of these SSL issues go away when we move to the production URL, which is masked by files.commons.gc.cuny.edu, which has a valid cert issued by Reclaim. We'll see, once everything is in place. IMO some of this could be handled more gracefully by the S3-Uploads plugin, and I might work up a PR or two for it, but it could also be the case that our current dev setup is so wonky and non-standard that it's not worth making upstream fixes. Again, this will probably be a bit clearer after we've moved to the production environment.

I've done some light testing on the Reclaim dev site and things are working as expected. My next step is to move forward with the stub version of the plugin that we can run on the legacy site to (a) push all existing media to the production bucket, and (b) sync media to S3 that's uploaded before the switchover. As I build this tool, I'll need to verify these ACL issues. I have a feeling that the bulk mechanism built into the plugin wp s3-uploads upload-directory doesn't pay any attention to ACL/private attachments. So I'll need to modify that tool, or build my own batch processor that runs after the fact (by looping through all non-public sites on the network, or something like that).

Actions #8

Updated by Boone Gorges 16 days ago

Another update.

After some research and consideration, I've decided to modify the approach to handling existing content on the legacy site prior to migration. Recall that I'd originally suggested running a stub plugin on the legacy site, which would (a) upload existing content to S3, and (b) sync newly uploaded content to S3 until launch. I ran into a couple problems with this:

1. The S3-Uploads plugin has some internal requirements that need PHP 8.1+. Working around this seemed like a big hassle.
2. S3-Uploads works by swapping out basedir/baseurl in wp_upload_dir, which means that any plugin that handles uploaded content using these paths will automatically have its objects pushed to S3. In contrast, I was hoping to build a mirror tool, one that allows the files to be uploaded into their regular place in the filesystem, and also synced to S3. The mirror technique would have required hooking into WP somewhere, maybe on the 'wp_handle_upload' action or something. This is less general than the default behavior of wp_upload_dir, since it's likely that plugins are building paths using wp_upload_dir and then putting files there using WP_Filesystem or other methods that don't involve the wp_handle_upload hook or related hooks. Any of these types of uploads would have been missed by the mirroring tool.

My original goal was to avoid doing two entire uploads to S3, which is wasteful and also very time-consuming. As a compromise, I've built a differential upload tool. It uses the S3 SDK directly, so it doesn't face the same PHP requirements as S3-Uploads. And it works around the wastefulness/slowness of a double-upload by checking for files before uploading, something that S3-Uploads's tools don't do. With that in mind, here's what I propose:

1. Install my tool on the legacy Commons site. Here's the repo https://github.com/cuny-academic-commons/s3-diff-sync and here's the specific tool https://github.com/cuny-academic-commons/s3-diff-sync/blob/main/s3-diff-sync.php
2. Run s3-diff-sync.php to put existing content into the production S3 bucket
3. When the site is shut down for migration, run s3-diff-sync.php again to upload missing files. It will still take a while, but much less time than a full re-upload.
4. Then, run the handle-acl-for-all-sites.php tool https://github.com/cuny-academic-commons/s3-diff-sync/blob/main/handle-acl-for-all-sites.php, which uses S3-Uploads (OK since we'll be in Reclaim and PHP 8.1+) to set 'private' ACL on all sites that need it

I think this is a fair compromise and should be relatively error-proof. Ray and Jeremy, I'd be glad if you could have a brief look and let me know whether the plan seems OK.

NB that the s3-diff-sync.php tool is not yet complete. It handles blogs.dir, but I need to add support for wp-content/uploads/ as well (this is trivial but I haven't done it yet). We also have to handle bp-group-documents, which will take slightly more finagling since (a) all of those files are "private", in the sense that they're always served via PHP rather than the webserver, and (b) they currently live outside the web root, which won't work right with the way S3-Uploads builds S3 prefixes/object keys. So this'll have to be changed - see https://github.com/cuny-academic-commons/cac/blob/c68628ef8d632abfcf5346bfbd184498aa54169d/wp-content/plugins/cac-bp-custom-includes/group-documents.php#L11. More on this to come in the upcoming days.

Actions #9

Updated by Boone Gorges 4 days ago

As noted on yesterday's dev call, the initial population of S3 from production took a very long time. I started it on the afternoon of Friday Dec 13. By the afternoon of Tuesday Dec 17, it was into roughly wp-content/blogs.dir/840 (alphabetical order, so it still had to go through many of the sites beginning with 8 and all the sites beginning with 9). I had to kill this upload routine due to other debugging on the site, but this gave me an opportunity to test the differential uploader when restarting. The good news is that it's quite fast: I started this run at about 7am CST on 2024-12-18, and now, about 2 hours later, it's already caught up to where it left off yesterday. It successfully uploaded about 4000 files and skipped hundreds of thousands of existing files. This is good news because it means that the differential upload is going to be quite speedy, and it won't be a bottleneck during our migration shutdown.

Actions #10

Updated by Matt Gold 4 days ago

so glad to hear this!

Actions #11

Updated by Boone Gorges 4 days ago

  • Related to Bug #21666: Forum attachment URLs in group library are not filtered by s3-uploads added
Actions

Also available in: Atom PDF