Project

General

Profile

Actions

Feature #21383

open

Feature #21380: Hosting migration

Offload media using S3-Uploads

Added by Boone Gorges 2 months ago. Updated 10 days ago.

Status:
New
Priority name:
Normal
Assignee:
Category name:
Server
Target version:
Start date:
2024-11-01
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

As part of our migration to Reclaim, we will be offloading our media files to Amazon S3. This is necessary for cost reasons, as well as for compatibility with load-balancing and other high-availability infrastructure at Reclaim.

Reclaim has requested that we use the following tool from Human Made: https://github.com/humanmade/S3-Uploads

Our first task is to gauge compatibility between this tool and the various parts of the Commons. As a starting place, here's a list of concerns:

1. We currently have a custom tool that uses a dynamically-generated .htaccess file to protect files uploaded to a private site. See https://github.com/cuny-academic-commons/cac/blob/2.4.x/wp-content/mu-plugins/cac-file-protection.php. We've got to determine whether this will continue to be compatible with S3-Uploads. My initial guess is that it won't, since S3-Uploads filters attachment URLs. Related, S3-Uploads allows uploaded files to be "private" https://github.com/humanmade/S3-Uploads?tab=readme-ov-file#private-uploads. I don't really understand what this does, so we'll have to research and understand whether it accomplishes something similar, and if so, how we migrate to it.

2. We allow file uploads of several types that aren't related to the WP Media Library. On the primary site, this includes user avatars, group avatars, forum attachments, buddypress-group-documents. On secondary sites, it might include various plugins that use a non-standard technique for accepting uploads (see eg Gravity Forms). We need to figure out what S3-Uploads means for all of these. It's possible that S3-Uploads won't interfere with them at all - ie, files will continue to be uploaded to and served from the web server. If so, we'll have to determine whether this is OK (in terms of performance, backups, cost, etc). The answer may differ depending on file type: I can image, for example, that it'd be OK to keep avatars on the web server, but that we'd be more motivated to move (potentially much larger) buddypress-group-documents to S3.

3. Reclaim has suggested that our team may want to roll out S3-Uploads integration before we do our final migration. There are a couple of reasons to like this idea: it reduces the number of moving parts on migration day, and it gives us plenty of lead time to upload existing files (1+ TB worth) to S3 well in advance of the launch date. Our team needs to decide whether this is feasible, and if so, when it will happen. Reclaim is serving as our vendor for AWS (ie we're paying Reclaim, and they're paying AWS), so we would need Reclaim to help us configure our bucket(s) in order for us to move forward with this.

Ray and Jeremy, I've never run a large site with S3-offloaded content, and I've definitely never run a migration of an existing site. Have you? It would be great to get your impression of the project, and your warnings about potential problems that I haven't discussed above.


Files

s3-uploads.diff (1.13 KB) s3-uploads.diff Boone Gorges, 2025-01-03 01:34 PM

Related issues

Related to CUNY Academic Commons - Bug #21483: CV cover/profile image URL generation not compatible with S3-UploadsResolvedJeremy Felt2024-11-13

Actions
Related to CUNY Academic Commons - Bug #21666: Forum attachment URLs in group library are not filtered by s3-uploadsRejectedBoone Gorges2024-12-18

Actions
Actions #1

Updated by Raymond Hoh 2 months ago

Ray and Jeremy, I've never run a large site with S3-offloaded content, and I've definitely never run a migration of an existing site. Have you?

I've worked on a multisite site that uses S3 Uploads. We maintain a fork that addresses a few issues. See
https://github.com/humanmade/S3-Uploads/compare/master...hwdsb:S3-Uploads:hwdsb-mods.

For the Commons, this would namely include the following:
- Fixes an issue with some older multisite URLs that use /blogs.dir/ in their uploads directory: https://github.com/WordPress/WordPress/blob/master/wp-includes/ms-default-constants.php#L31. The /blogs.dir/ uploads directory would apply for sites that existed before WordPress MU was merged into WordPress Core. S3 Uploads does not take this into account. We probably have older sites that use /blogs.dir/ for their uploads directory as well so this issue would apply to us as well. I forget the particulars of this issue, but I note this just so we are aware.
- Incompatibility with Gravity Forms. As you mentioned in point 2, there is an issue here with some non-standard plugins. I didn't look too far into the actual issue with Gravity Forms. I'm doing a dirty bail fix in the s3-uploads fork.
- Image URL rewriting via filters needed to be done as well for themes using a custom header image and background image since these URLs are written into the DB as theme mods and these URLs were referencing the local URLs instead of the S3 URLs.
- Also for this site, BuddyPress avatar uploads remain being served locally.

Reclaim is serving as our vendor for AWS (ie we're paying Reclaim, and they're paying AWS), so we would need Reclaim to help us configure our bucket(s) in order for us to move forward with this.

Do we have an estimate on the potential cost of using AWS? This could be in the thousands of dollars per year.

Actions #2

Updated by Boone Gorges 2 months ago

Ray, thanks so much for this!

- Fixes an issue with some older multisite URLs that use /blogs.dir/ in their uploads directory

Do you think Human Made would accept a PR for this? Seems like a general problem.

Do we have an estimate on the potential cost of using AWS? This could be in the thousands of dollars per year.

It's rolled into the top-line number that Reclaim is charging us. This is by design: I didn't want our team to be responsible for covering these variable costs, not to mention the overhead associated with configuring, maintaining, troubleshooting, itc.

Actions #3

Updated by Boone Gorges 2 months ago

Reclaim has asked what we'd like to use as the rewrite domain for S3-stored uploads. By default, S3 URLs are long and unwieldy, but we can rewrite them as something like files.commons.gc.cuny.edu/sites/1234/2024/01/foo.jpg. Are we OK with using files.commons.gc.cuny.edu for this purpose, or is there a better idea floating out there? We should decide this soon, because it'll require a DNS change at the Graduate Center, and I would like to be able to include this ask in our initial round of communication with GC IT.

Actions #4

Updated by Boone Gorges 2 months ago

S3-Uploads is now running in the Reclaim dev environment.

At first, I tried running Ray's fork. But this caused a couple problems. First, it used an old copy of the AWS SDK, which made it difficult to debug. Second, the blogs.dir fixes didn't work right; from my reading, they assume that the webserver kept the upload files in blogs.dir, but that on S3 they'd be in the /uploads/sites/ bucket. I don't think it's necessary to do this: we can simply upload the blogs.dir and the uploads buckets separately, and everything appears to work correctly with the latest S3-Uploads, out of the box.

Basic media uploads, as well as BP avatars, appear to be working properly. I'll be assembling a list in the upcoming days of all upload types I can think of, and then I'll run some preliminary tests on each. Based on those tests, we can decide whether it's necessary and/or desirable to filter non-standard upload types to S3.

As a reminder, in today's dev call we discussed point 3 above and tentatively decided to do the following:
a. Sometime in the upcoming weeks, get the production S3 bucket set up
b. Modify the S3-Uploads plugin so that it can upload existing content to the bucket, and also sync new uploads. Activate that modified plugin on the legacy production site.
c. Run the upload-directory tool in the s3-uploads plugin. Start with small amounts of content to ensure it doesn't crash the production site.
d. When the Reclaim production environment is created and we're near switchover time, hook it up to the pre-filled S3 bucket.

I'll begin working on the necessary mods to the S3-Uploads plugin, and I'll share them here for review before we deploy them to the legacy site.

Actions #5

Updated by Raymond Hoh 2 months ago

At first, I tried running Ray's fork. But this caused a couple problems. First, it used an old copy of the AWS SDK, which made it difficult to debug. Second, the blogs.dir fixes didn't work right; from my reading, they assume that the webserver kept the upload files in blogs.dir, but that on S3 they'd be in the /uploads/sites/ bucket. I don't think it's necessary to do this: we can simply upload the blogs.dir and the uploads buckets separately, and everything appears to work correctly with the latest S3-Uploads, out of the box.

I think the other issue with the site I was working on is they were migrating from another S3 storage plugin, WP Offload Media, over to Human Made's S3 Uploads and they were using an existing S3 bucket rather than starting from scratch. I'm glad that this isn't a problem with our install!

Basic media uploads, as well as BP avatars, appear to be working properly.

Glad that this is working as well!

Actions #6

Updated by Boone Gorges 2 months ago

  • Related to Bug #21483: CV cover/profile image URL generation not compatible with S3-Uploads added
Actions #7

Updated by Boone Gorges about 1 month ago

I've got the ACL tooling working on the Reclaim dev server. Here's the s3-uploads modification plugin I built: https://github.com/cuny-academic-commons/cac/blob/reclaim-migration/wp-content/mu-plugins/s3-uploads.php

For the sake of posterity, I'm going to outline some of the work I had to do.

1. TIL that S3 bucket policies are read before object-level ACL. Reclaim had originally set up the bucket using a policy that allowed read access to everything in it. As such, my ACL rules were being ignored. I changed the policy to https://gist.github.com/boonebgorges/bad405c0f41bf3c6dc62d2003437fef3, which means that: (a) items with public ACL are visible to anyone, (b) items with private ACL are not visible to anyone. S3 URLs that are signed are essentially "authenticated", which means that the bucket policy is bypassed. This is the mechanism that allows private files to be served, since S3-Uploads provides signed URLs for non-private objects when WP renders the page.

2. The first method https://github.com/cuny-academic-commons/cac/blob/206b834879405d6f62e65761fc5b9ea4c61c4ade/wp-content/mu-plugins/s3-uploads.php#L7-L25 forces attachments to be "private" (from the POV of s3-uploads) whenever the current site has blog_public < 0. In the future, we may decide to introduce other situations where items are made private, using this mechanism.

3. The plugin natively handles ACL setting at the time of upload. My secondary plugin handles switching a site's attachments' ACL when blog_public is changed. See https://github.com/cuny-academic-commons/cac/blob/206b834879405d6f62e65761fc5b9ea4c61c4ade/wp-content/mu-plugins/s3-uploads.php#L127-L229. ACL changes must be performed not just for each attachment, but also for each image size. This can result in a huge number of requests and a lot of latency. So I put it into an async process, which runs 10 attachments in a batch on a self-refreshing cron job. It's pretty rudimentary but I think it's probably good enough for now.

4. I ran into a bunch of problems related to SSL certs. The root issue is that Reclaim created our dev bucket with the name files.cunyac.reclaimhosting.dev. This bucket name has periods in it. As such, when S3 assembles its default URL structure using the [bucket].s3.amazonaws.com format, the result is files.cunyac.reclaimhosting.dev.s3.amazonaws.com. These dots mean that we're not subsumed under AWS's wildcard *.s3.amazonaws.com SSL cert, which means that requests fail in a bunch of different ways. I worked around this using a few filters https://github.com/cuny-academic-commons/cac/blob/206b834879405d6f62e65761fc5b9ea4c61c4ade/wp-content/mu-plugins/s3-uploads.php#L27-L125, and also with a modification in the plugin itself (in get_s3_location_for_url()):

        // CAC mod: We are using "path" mode rather than virtual-hosted-style URLs.
        if ( strpos( $url, 'https://s3.amazonaws.com/' ) === 0 ) {
            $parsed = wp_parse_url( $url );

            $key = str_replace( '/'. $this->get_s3_bucket() . '/', '', $parsed['path'] );

            return [
                'bucket' => $this->get_s3_bucket(),
                'key'    => $key,
                'query'  => $parsed['query'] ?? null,
            ];
        }

It's likely that at least some of these SSL issues go away when we move to the production URL, which is masked by files.commons.gc.cuny.edu, which has a valid cert issued by Reclaim. We'll see, once everything is in place. IMO some of this could be handled more gracefully by the S3-Uploads plugin, and I might work up a PR or two for it, but it could also be the case that our current dev setup is so wonky and non-standard that it's not worth making upstream fixes. Again, this will probably be a bit clearer after we've moved to the production environment.

I've done some light testing on the Reclaim dev site and things are working as expected. My next step is to move forward with the stub version of the plugin that we can run on the legacy site to (a) push all existing media to the production bucket, and (b) sync media to S3 that's uploaded before the switchover. As I build this tool, I'll need to verify these ACL issues. I have a feeling that the bulk mechanism built into the plugin wp s3-uploads upload-directory doesn't pay any attention to ACL/private attachments. So I'll need to modify that tool, or build my own batch processor that runs after the fact (by looping through all non-public sites on the network, or something like that).

Actions #8

Updated by Boone Gorges about 1 month ago

Another update.

After some research and consideration, I've decided to modify the approach to handling existing content on the legacy site prior to migration. Recall that I'd originally suggested running a stub plugin on the legacy site, which would (a) upload existing content to S3, and (b) sync newly uploaded content to S3 until launch. I ran into a couple problems with this:

1. The S3-Uploads plugin has some internal requirements that need PHP 8.1+. Working around this seemed like a big hassle.
2. S3-Uploads works by swapping out basedir/baseurl in wp_upload_dir, which means that any plugin that handles uploaded content using these paths will automatically have its objects pushed to S3. In contrast, I was hoping to build a mirror tool, one that allows the files to be uploaded into their regular place in the filesystem, and also synced to S3. The mirror technique would have required hooking into WP somewhere, maybe on the 'wp_handle_upload' action or something. This is less general than the default behavior of wp_upload_dir, since it's likely that plugins are building paths using wp_upload_dir and then putting files there using WP_Filesystem or other methods that don't involve the wp_handle_upload hook or related hooks. Any of these types of uploads would have been missed by the mirroring tool.

My original goal was to avoid doing two entire uploads to S3, which is wasteful and also very time-consuming. As a compromise, I've built a differential upload tool. It uses the S3 SDK directly, so it doesn't face the same PHP requirements as S3-Uploads. And it works around the wastefulness/slowness of a double-upload by checking for files before uploading, something that S3-Uploads's tools don't do. With that in mind, here's what I propose:

1. Install my tool on the legacy Commons site. Here's the repo https://github.com/cuny-academic-commons/s3-diff-sync and here's the specific tool https://github.com/cuny-academic-commons/s3-diff-sync/blob/main/s3-diff-sync.php
2. Run s3-diff-sync.php to put existing content into the production S3 bucket
3. When the site is shut down for migration, run s3-diff-sync.php again to upload missing files. It will still take a while, but much less time than a full re-upload.
4. Then, run the handle-acl-for-all-sites.php tool https://github.com/cuny-academic-commons/s3-diff-sync/blob/main/handle-acl-for-all-sites.php, which uses S3-Uploads (OK since we'll be in Reclaim and PHP 8.1+) to set 'private' ACL on all sites that need it

I think this is a fair compromise and should be relatively error-proof. Ray and Jeremy, I'd be glad if you could have a brief look and let me know whether the plan seems OK.

NB that the s3-diff-sync.php tool is not yet complete. It handles blogs.dir, but I need to add support for wp-content/uploads/ as well (this is trivial but I haven't done it yet). We also have to handle bp-group-documents, which will take slightly more finagling since (a) all of those files are "private", in the sense that they're always served via PHP rather than the webserver, and (b) they currently live outside the web root, which won't work right with the way S3-Uploads builds S3 prefixes/object keys. So this'll have to be changed - see https://github.com/cuny-academic-commons/cac/blob/c68628ef8d632abfcf5346bfbd184498aa54169d/wp-content/plugins/cac-bp-custom-includes/group-documents.php#L11. More on this to come in the upcoming days.

Actions #9

Updated by Boone Gorges 27 days ago

As noted on yesterday's dev call, the initial population of S3 from production took a very long time. I started it on the afternoon of Friday Dec 13. By the afternoon of Tuesday Dec 17, it was into roughly wp-content/blogs.dir/840 (alphabetical order, so it still had to go through many of the sites beginning with 8 and all the sites beginning with 9). I had to kill this upload routine due to other debugging on the site, but this gave me an opportunity to test the differential uploader when restarting. The good news is that it's quite fast: I started this run at about 7am CST on 2024-12-18, and now, about 2 hours later, it's already caught up to where it left off yesterday. It successfully uploaded about 4000 files and skipped hundreds of thousands of existing files. This is good news because it means that the differential upload is going to be quite speedy, and it won't be a bottleneck during our migration shutdown.

Actions #10

Updated by Matt Gold 26 days ago

so glad to hear this!

Actions #11

Updated by Boone Gorges 26 days ago

  • Related to Bug #21666: Forum attachment URLs in group library are not filtered by s3-uploads added
Actions #12

Updated by Boone Gorges 17 days ago

I've done another round of work on S3 integration, this time to account for legacy URLs that are stored in the database. I'm going to spell out the details here for posterity.

An upload path may be saved in the database in a number of ways. Most frequently, it's as the href attribute of an anchor link, or as the src attribute of an img tag. These URLs fall into one of three categories:
1. ms-files style: These are URLs from the days of WPMU and early-Multisite. They have the format {blog_url}/files/{relative_path}. In a default WPMS installation, these URL formats are redirected in an .htaccess rule to ms-files.php, which fills in the missing {blog_id} and finds the file. We use a custom cac-files.php to do a similar job, along with some permission checks. These URLs are generally found only in old sites.
2. blogs.dir style: This is the explicit URL of the format {blog_url}/wp-content/blogs.dir/{blog_id}/files/{relative_path}. These are true filesystem paths, so can be served directly by Apache without any help from WP. For this reason, we put fixes in place long ago to ensure that new URLs were written to the database in this format. See #19055. We use blogs.dir because we are a legacy WPMu conversion.
3. uploads style: A new installation of WPMS will put uploaded files in wp-content/uploads/{blog_id}/files. Some plugins don't account for the possibility of blogs.dir-style paths, and so they manually build paths using the 'uploads' format even if 'wp-content/blogs.dir' === UPLOADBLOGSDIR.

For most cases of blogs.dir- and uploads-style URLs, the S3 redirect can be handled at the webserver level. Here are the new rewrite rules in .htaccess:

# Uploaded files with blogs.dir can be redirected right to S3.
RewriteCond %{REQUEST_URI} wp-content/blogs\.dir/([0-9]+) [NC]
RewriteRule ^(wp-content/blogs\.dir/.*)$ https://s3.amazonaws.com/files.commons.gc.cuny.edu/$1 [R=301,L]

# Uploaded files with uploads can be redirected right to S3.
RewriteCond %{REQUEST_URI} wp-content/uploads/([0-9]+) [NC]                                            
RewriteRule ^(wp-content/blogs\.dir/.*)$ https://s3.amazonaws.com/files.commons.gc.cuny.edu/$1 [R=301,L]

This technique will not work for ms-files URLs. In this case, we need to bootstrap WP in order to look up the blog_id and convert to a blogs.dir-style URL. For this purpose, I've introduced the following rewrite rule (which appears just before the ones described above):

# Uploaded files without blog ID should go to handler.
RewriteCond %{HTTP_HOST} ^([^.]+)\.commons\.gc\.cuny\.edu$ [NC]
RewriteRule ^files/(.*) wp-content/cac-blog-upload-handler.php?file=$1 [L]

In other words, anything from {subdomain}.commons.gc.cuny.edu/files will be sent to the cac-blog-upload-handler.php script. It looks like this: https://github.com/cuny-academic-commons/cac/blob/reclaim-migration/wp-content/cac-blog-upload-handler.php Basically, it takes the 'file' param and uses it to build the proper URL using wp_upload_dir(). Since wp_upload_dir() will be filtered by s3-uploads, the resulting URL will be the correct S3 URL.

Finally, none of these techniques will properly handle signed (ie private) S3 URLs. On a non-public site, simply swapping out {blog_url}/blogs.dir/{blog_id}/files/{relative_path} with {s3_url}/blogs.dir/{blog_id}/files/{relative_path} will give you an unsigned URL, which will not be accessible. For this case, I've introduced a post_content filter that looks for WP upload URLs (of any of the three styles described above), then converts them to S3 URLs in PHP, then asks the s3-uploads plugin to get a signed version of the URL. They're then swapped out at the time of render. Here's the commit: https://github.com/cuny-academic-commons/cac/commit/0ff28352a33ca16e53c3345256495c4da0d23e8f

I believe that this combination provides backward compatibility for the vast majority of situations:

a. You visit a public Commons site that contains old uploads-style or blogs.dir-style URLs. The .htaccess rules redirect these directly to the S3 URL, and the files are served as expected.
b. You visit a public Commons site that contains old ms-files-style URLs. The .htaccess rule directs them to cac-blog-upload-handler.php, which bootstraps WP and determines the S3 URL for that image. You're then redirected to that URL, and the file is served as expected. This requires loading WP, which is not fast, but we already did this with cac-files.php, so it's no worse than before.
c. You follow a URL from outside the Commons that points to a legacy WP upload URL on a public Commons site. The redirects take place as described above, and you're served the file.
d. You visit a private Commons site that contains old WP upload URLs in the post content. Our filter on the_content detects them and swaps them out before rendering the content. You are then served HTML that contains signed S3 URLs, and the files are served as expected.

There are a few cases that will not be handled properly:
e. On a private Commons site, a URL exists in the database but outside of post_content. My filter will not catch these. Some of these will be filtered by s3-uploads by other means, such as custom header URLs, which I believe WP rebuilds at runtime using wp_upload_dir(). But if you have, say, pasted an upload URL into a sidebar widget, that URL will not be signed, and the image/file will not be served. If and when we come across such instances, especially if they're being generated by plugins that are in broad use across the Commons, we may be able to write filters that swap out the URLs at runtime. It may also be possible to write a tool that searches the database for such instances and replaces them in the database. I'm trying to avoid this because (a) it'll take forever, and (b) it's destructive.
f. You follow a URL from outside the Commons that points to a legacy WP upload URL on a non-public Commons site. Say, for instance, someone emailed you a URL like https://my-class-site.commons.gc.cuny.edu/wp-content/blogs.dir/12345/files/2024/12/syllabus.pdf. If you visit this URL, you'll be redirected by an .htaccess rule to the S3 version of the URL. But that URL will not be signed, so you won't be served the URL.

Neither of these is specifically a problem with backward compatibility. Going forward, if I email someone the URL of a private Commons asset, those emailed URLs will generally not work (signed URLs expire after a fairly short amount of time). I think we should see this as a feature rather than a bug, since it decreases the likelihood that the files will be improperly downloaded. But it's a change from previous behavior.

Actions #13

Updated by Boone Gorges 10 days ago

Another update. Reclaim got far enough on their Cloudflare configuration so that they could set up an SSL cert for files.commons.gc.cuny.edu. With that in place, I tried rolling back the changes described above, so that instead of using the 'path-style' bucket URL https://s3.amazonaws.com/files.commons.gc.cuny.edu we would instead use https://files.commons.gc.cuny.edu. Specifically, I changed the value of the S3_UPLOADS_BUCKET_URL constant so that it pointed to 'https://files.commons.gc.cuny.edu/wp-content'. It mostly worked: uploads worked properly, and most objects were served properly.

However, signed URLs were not being properly generated. Signatures were being added to the URLs, but when trying to fetch the item, S3 served a SignatureDoesNotMatch error. I spent some time trying to debug this, including:
- Ensuring that the bucket name used when requesting a signed URL was correct. It was - the bucket name does not change when you change this constant, it's always 'files.commons.gc.cuny.edu'
- Passing an 'endpoint' parameter when building the S3 client. This is hinted at in the plugin's README.md and was suggested by ChatGPT. It didn't work - the signature still wasn't correct, and it broke some other aspects of our implementation of s3-uploads (bucket names were getting 'wp-content' double-appended - I stopped debugging to see whether it was a plugin bug or something in my customizations)

Having the files.commons.gc.cuny.edu is not critical for launch, so I'm going to set this aside. We will ship with path-style URLs of the form https://s3.amazonaws.com/files.commons.gc.cuny.edu. In the future, perhaps we can spend some more time figuring out why signatures aren't properly generated.

Actions #14

Updated by Boone Gorges 10 days ago

I should note that this decision means that we must continue to run a modified version of the s3-uploads plugin. For launch, I'm just going to leave a modified version on the server, and I'm attaching a diff in case it's lost for some reason. In the future, I'll explore creating a proper GitHub fork that contains the mod, or perhaps suggesting an upstream fix or filter that makes the mod unnecessary.

Actions

Also available in: Atom PDF