Project

General

Profile

Actions

Feature #5822

closed

RBE should work properly behind a load balancer

Added by Boone Gorges over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority name:
High
Assignee:
Category name:
Reply By Email
Target version:
Start date:
2016-07-21
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

RBE currently keeps track of open IMAP connections via the filesystem. This won't work when we're behind a load balancer, being served from two different machines. Maybe use the persistent cache?

(Ray, I know you're already working on this - just wanted to have a way to track it centrally.)

Actions #1

Updated by Raymond Hoh over 8 years ago

I've created a new plugin so RBE uses the object cache instead of the filesystem for the IMAP locking technique. See commit 196e012 in 1.9.x branch.

The migration steps will be as follows:
1. Deactivate RBE on both the new server and the current server.
2. Activate the "BP Reply By Email - IMAP Object Cache extension" plugin (use ?show_all_plugins=1 on the plugins.php page)
3. Activate RBE.

Boone, to test on the new server for now, should I cherry-pick this commit (as well as the object cache wp-config.php commit) to master branch? Or should I just make modifications to master branch directly and we'll worry about tidying up the index before migration?

Actions #2

Updated by Boone Gorges over 8 years ago

Thanks, Ray!

Go ahead and cherry-pick to master for testing.

Actions #3

Updated by Raymond Hoh over 8 years ago

So, I've been doing some debugging on the new server and have discovered that wp_remote_post() and wp_remote_get() returns a "http_request_failed" error if the timeout is <= 1 second when posting to local URLs.

Since RBE's IMAP inbox check and wp-cron uses 0.01 for the timeout, these tasks currently fail. I've found that if you set the timeout to 1.01 seconds though, wp_remote_post() will work again.

However, according to the RBE logs, this was working yesterday, which makes me think that there was a server configuration change. NYCDH.org doesn't appear to suffer from this problem.

I'll relay this info to Lihua.

Actions #4

Updated by Raymond Hoh over 8 years ago

After doing some more testing, I think it's something wrong with the server not performing loopback requests correctly.

I've checked the access log and POST requests to trigger cron (/wp-cron.php?doing_wp_cron=XXX) and RBE (/?bp-rbe-ping) appear to resolve correctly with HTTP 200.

However, both RBE and wp-cron do not run consistently. I installed WP Crontrol on the new server and it looks like wp-cron hasn't run in three days. RBE has run, but sporadically and not at the rate we expect it to run.

The interesting thing is if I disable WP cron and run it manually with /wp-cron.php?doing_wp_cron, WP cron will run. The problem with disabling WP cron and relying on a real cron job is triggering wp-cron.php across all our sub-sites. (If you're checking WP Crontrol, I just manually triggered wp-cron by disabling WP cron.)

If I manually ping RBE with a CURL POST request to trigger RBE, it will also run:
curl -X POST -d "_bp_rbe_check=1" http://commons.gc.cuny.edu

Perhaps we should consider using a real cronjob.

Actions #5

Updated by Boone Gorges over 8 years ago

Thanks for continuing to debug, Ray.

Using a real cronjob trigger wp-cron seems fine to me. But if there's an underlying issue with loopback requests on the new server, it should also be addressed. There are likely other places in the Commons codebase where such requests are made. They're probably not as mission-critical as wp-cron, but are worth addressing nonetheless.

If I manually ping RBE with a CURL POST request to trigger RBE, it will also run:

Are you running this from your local machine, or from the server?

Actions #6

Updated by Raymond Hoh over 8 years ago

Here's an interesting wrinkle, if I disable the PECL Memcached object cache drop-in, wp-cron and RBE works again.

Maybe we should look into using a better WP Memcached object cache alternative plugin?


Are you running this from your local machine, or from the server?

I tested from my local machine with HOSTS file applied.

Actions #7

Updated by Boone Gorges over 8 years ago

Maybe we should look into using a better WP Memcached object cache alternative plugin?

I'm not sure how drop-in would be responsible for such a thing - it's a pretty transparent interface to the PHP extension. But by all means, test other drop-ins - I have no allegiance to this one (other than the support for multi-get and multi-set, which may be implemented in WP in the next version or two).

Actions #8

Updated by Raymond Hoh over 8 years ago

Sorry for not replying to Lihua yesterday.

I'm still not convinced of my theories, which is why I haven't replied yet. I've made some headway though.

One issue I encountered is with PHP sessions. The WP Ajaxify Comments plugin, which is activated on the Commons, creates a PHP session with session_start(). PHP sessions interfere with caching (this might have caused many of our caching related problems on our current production site!). In the meantime, I've disabled WP Ajaxify Comments.

Right now, I believe something is wrong with our Memcached caching configuration.

Check out my RBE debug log at /wp-content/uploads/ray.log. The "connected time" value shouldn't be false when it is connected. If you look at the log, the cached value only appears when on the ?bp-rbe-ping URI. Not sure why the value is not the same across all pages.

For WP Cron, when Memcached is on, I've debugged wp-cron.php and the timestamp values do not match from the get_transient( 'doing_cron' ) call and the $_GET['doing_wp_cron'] query parameter. This is why wp-cron doesn't run:
https://github.com/WordPress/WordPress/blob/4.5-branch/wp-cron.php#L84

Also, I've installed phpMemcachedAdmin to see what is happening with the cache. Check out /_mem/index.php?server=127.0.0.1:11211&show=slabs. Is there supposed to be so many slabs?

Actions #9

Updated by Boone Gorges over 8 years ago

One issue I encountered is with PHP sessions. The WP Ajaxify Comments plugin, which is activated on the Commons, creates a PHP session with session_start(). PHP sessions interfere with caching (this might have caused many of our caching related problems on our current production site!). In the meantime, I've disabled WP Ajaxify Comments.

Is the problem with sessions and Memcached, or sessions and the load balancer? I wonder if we should tell PHP to store sessions in Memcached. (I wonder if it's even possible, given that we share a PHP config with all other PHP applications in the cluster.) https://www.dotdeb.org/2008/08/25/storing-your-php-sessions-using-memcached/

Is there supposed to be so many slabs?

Yes. My understanding of memcached is that each slab represents the size of the objects that can be stored there. So Slab 1 can store objects between 0 and 96B, Slab 2 can store objects between 96.1 and 120.0B, Slab 3 can store objects between 120.1 and 152.0B, etc. (Slab size increases at a power of 1.25^n.) I believe that the reason it's set up like this is so that memory slots can be allocated without resulting in too much fragmentation and wasted space. So, I think it's normal behavior.

Right now, I believe something is wrong with our Memcached caching configuration.

Soooooo... I'm looking at the setup, and it appears that we're using 127.0.0.1:11211 for the memcached server. This is not correct, since that address will point to a different instance of memcached depending on which server we're on. When I first set up Memcached on the new cluster, I defined a custom $memcached_servers array, but that configuration got wiped out when we did another sync from the production server. However, I'm now trying to reconfigure things, and I'm finding that the Memcached servers aren't available at the address where I thought they should be. I'm going to follow up with Lihua about it.

Actions #10

Updated by Matt Gold over 8 years ago

I wonder if it's even possible, given that we share a PHP config with all other PHP applications in the cluster.

Boone, if you could keep a log of issues that may make the case for a separate cluster for the Commons alone, I would appreciate it. As discussed, I don't think that is or should be our first path forward, but I don't want (as I'm sure you don't want) the workarounds we're trying to set up to match the server conditions to diminish the features of the Commons. If it turns out that we feel we need a separate cluster, it would be good to have a list of the issues that make the argument.

Actions #11

Updated by Boone Gorges over 8 years ago

Ray - Lihua has provided the correct IPs for the Memcached server pool, and it appears (based on ray.log, at least) that things are working properly. Can you have a look?

Boone, if you could keep a log of issues that may make the case for a separate cluster for the Commons alone, I would appreciate it. As discussed, I don't think that is or should be our first path forward, but I don't want (as I'm sure you don't want) the workarounds we're trying to set up to match the server conditions to diminish the features of the Commons. If it turns out that we feel we need a separate cluster, it would be good to have a list of the issues that make the argument.

Matt - Yes, I'll make a note.

Actions #12

Updated by Raymond Hoh over 8 years ago

Lihua has provided the correct IPs for the Memcached server pool, and it appears (based on ray.log, at least) that things are working properly. Can you have a look?

Can confirm that using the correct memcached servers fixed the RBE and wp-cron issues we were having.

I committed a change to the RBE IMAP object cache plugin to fix issues with fetching the newer cache.

Going to mark this as resolved.

Actions #13

Updated by Raymond Hoh over 8 years ago

  • Status changed from Assigned to Resolved
Actions #14

Updated by Boone Gorges over 8 years ago

Excellent! Thanks, Ray.

Actions

Also available in: Atom PDF