Continued debugging of runaway MySQL connections
FYI -- there were MySQL restart alerts at 4:11pm today. I wanted to let you know in case you are better able to track info related to such outages following our recent conversations with Lihua
#1 Updated by Boone Gorges 6 months ago
- Subject changed from Commons reboot to Continued debugging of runaway MySQL connections
Some progress has been made on the debugging front. I'll summarize so that we have a record of what's happened.
I'll place the URLs of the log files in a follow-up comment, which will be private, because they are public logs and they may contain sensitive information in some cases.
Lihua has set up a "commstat.txt" log that dumps a netstat record every couple of minutes. When available, the process ID of the initiating application is listed in the right-hand column. When it says 'httpd', it's a web request; when it says 'php', it's a Cavalcade process. At the same time, commstat.txt contains a port number that allows cross-referencing with the MySQL processlist log.
So it's now possible to draw a line between latent MySQL connections and WordPress processes. Here's how it works in one direction; just reverse the steps to go the other way:
1. In ld1a_processlist.txt, identify the problematic database connections. These are generally the ones that are put in 'Sleep' status for a long time. The main symptom of our ongoing performance problem is when these Sleep processes pile up rapidly and refuse to close, causing all available connections to be used up.
2. Each database connection has a unique port ID, in the Host column after the database IP address.
3. In commstat.txt, find instances of the port number (left-hand column) that correspond to the timestamp of ld1a_processlist.txt.
4. Relevant commstat.txt records should have a client (php or httpd) along with a pid in the right-hand column. Use this pid to cross-reference with the relevant log file on ldv2, in /scratch/weblog/lw2b/commons (or lw2a, depending on the active node). For me, this is symlinked to ~/logs/httpd/commons-production-lw2b. The relevant httpd log is ssl_access.log, and the relevant php log is cavalcade/cavalcade-[todaysdate].log
#3 Updated by Boone Gorges 6 months ago
With these new logging systems in place, I was able to identify the beginning of a pattern in this weekend's outages. During that incident, most of the Sleep database connections were linked to entries in the Apache access log that looked like this. I've annotated the columns to make clear what's happening:
Requesting IP Timestamp Method Path Protocol Status code Referer User-agent Apache PID Requested Host 220.127.116.11 - - [06/Feb/2021:15:32:50 -0500] "GET / HTTP/1.0" 200 36565 "-" "-" 89419 ewhteam6.commons.gc.cuny.edu
All requests were coming ewhteam6.commons.gc.cuny.edu. 18.104.22.168 (and .222) are the IP addresses of our own Apache nodes. So what this looks like to me is:
1. Something is triggering a server-side HTTP request to the current URL. The request is blocking, so that it keeps open the process and the db connection until the request is complete.
2. When the URL is loaded by this self-request, the same thing happens, causing infinitely nested requests that never close, because they never complete.
The odd thing is that these requests have no referer and no user-agent. So they are not normal request, and it appears that they probably don't come through WP's HTTP API. That being said, I've set up a small utility that logs all wp_remote_request() self-GETs, and I'll monitor it in the upcoming days for anything strange.
Ray and Jeremy, if you have any insight into this pattern, it'd be welcome.
#4 Updated by Raymond Hoh 6 months ago
Boone, I took a look at some of the cross-referenced requests and most of them involve 408 responses.
[09/Feb/2021:03:43:05 -0500] "-" 408 - "-" "-" 99334 commons.gc.cuny.edu
So I googled
408 response wordpress and an old BuddyPress thread popped up involving the sidebar login form:
The issue is the HTML ordering of a few of the login fields (
<input type="hidden"> and
<input type="submit">) can cause a 408 response.
It just so happens that we use a version of the BuddyPress sidebar login form in the admin bar login. I've just reordered the submit and hidden redirect field on production in
wp-content/plugins/cac-bp-admin-bar-mods. I haven't committed it yet.
I'm not sure if the issue is really that simple, but it's worth a try.
#6 Updated by Raymond Hoh 6 months ago
Spoke too soon :( Looks like there are still 408 responses coming in on lw2a. We seem to be getting at least a few of these every minute.
Here's a thread talking about blank 408 responses: https://serverfault.com/questions/473466/increased-number-of-408-request-timeouts-for-null-on-a-webserver
They theorize that it's the browser's predictive browsing feature that is causing this.
#7 Updated by Boone Gorges 6 months ago
You might be right that browser behavior is to blame for the handful of Sleep connections that we see on a regular basis in the logs.
But I think that these are a separate issue from the Sleeps that pile up during an outage. Those are not tied to 408s, they're tied to 200s. And they don't come from user IPs, they come from the Commons itself. (The call is coming from inside the house!) In both cases, the problem is manifested in Sleep connections, but I don't think they're the very same thing.