Bug #13949: Continued debugging of runaway MySQL connections - CUNY Academic Commons - CUNY Graduate Center - Project Tracking System

Custom queries

. 2.6.5 open
2.7.0 open
2.7.0 open (by category)
2.7.0 open (by tracker)
High Priority Issues by User
Inbox (no milestone)
My Assigned Issues (by milestone)
My Watched Issues (by milestone)
Not Tracked - grouped by assignee
Not tracked (milestone-independent)

Actions

Copy link

Bug #13949

open

Continued debugging of runaway MySQL connections

Added by Matt Gold almost 5 years ago. Updated over 4 years ago.

Status:

New

Priority name:

Normal

Assignee:

Boone Gorges

Category name:

Target version:

Not tracked

Start date:

2021-02-06

Due date:

% Done:

Estimated time:

Deployment actions:

Description

Hi Boone,

FYI -- there were MySQL restart alerts at 4:11pm today. I wanted to let you know in case you are better able to track info related to such outages following our recent conversations with Lihua

History
Notes
Property changes

Actions

Copy link

Updated by Boone Gorges almost 5 years ago

Subject changed from Commons reboot to Continued debugging of runaway MySQL connections

Some progress has been made on the debugging front. I'll summarize so that we have a record of what's happened.

I'll place the URLs of the log files in a follow-up comment, which will be private, because they are public logs and they may contain sensitive information in some cases.

Lihua has set up a "commstat.txt" log that dumps a netstat record every couple of minutes. When available, the process ID of the initiating application is listed in the right-hand column. When it says 'httpd', it's a web request; when it says 'php', it's a Cavalcade process. At the same time, commstat.txt contains a port number that allows cross-referencing with the MySQL processlist log.

So it's now possible to draw a line between latent MySQL connections and WordPress processes. Here's how it works in one direction; just reverse the steps to go the other way:
1. In ld1a_processlist.txt, identify the problematic database connections. These are generally the ones that are put in 'Sleep' status for a long time. The main symptom of our ongoing performance problem is when these Sleep processes pile up rapidly and refuse to close, causing all available connections to be used up.
2. Each database connection has a unique port ID, in the Host column after the database IP address.
3. In commstat.txt, find instances of the port number (left-hand column) that correspond to the timestamp of ld1a_processlist.txt.
4. Relevant commstat.txt records should have a client (php or httpd) along with a pid in the right-hand column. Use this pid to cross-reference with the relevant log file on ldv2, in /scratch/weblog/lw2b/commons (or lw2a, depending on the active node). For me, this is symlinked to ~/logs/httpd/commons-production-lw2b. The relevant httpd log is ssl_access.log, and the relevant php log is cavalcade/cavalcade-[todaysdate].log

Actions

Copy link

Updated by Boone Gorges almost 5 years ago

With these new logging systems in place, I was able to identify the beginning of a pattern in this weekend's outages. During that incident, most of the Sleep database connections were linked to entries in the Apache access log that looked like this. I've annotated the columns to make clear what's happening:

Requesting IP      Timestamp                    Method Path  Protocol  Status code         Referer  User-agent  Apache PID    Requested Host
146.96.128.221 - - [06/Feb/2021:15:32:50 -0500] "GET   /     HTTP/1.0" 200          36565  "-"      "-"         89419         ewhteam6.commons.gc.cuny.edu

All requests were coming ewhteam6.commons.gc.cuny.edu. 146.96.128.221 (and .222) are the IP addresses of our own Apache nodes. So what this looks like to me is:

1. Something is triggering a server-side HTTP request to the current URL. The request is blocking, so that it keeps open the process and the db connection until the request is complete.
2. When the URL is loaded by this self-request, the same thing happens, causing infinitely nested requests that never close, because they never complete.

The odd thing is that these requests have no referer and no user-agent. So they are not normal request, and it appears that they probably don't come through WP's HTTP API. That being said, I've set up a small utility that logs all wp_remote_request() self-GETs, and I'll monitor it in the upcoming days for anything strange.

Ray and Jeremy, if you have any insight into this pattern, it'd be welcome.

Actions

Copy link

Updated by Raymond Hoh almost 5 years ago

Boone, I took a look at some of the cross-referenced requests and most of them involve 408 responses.

An example:

[09/Feb/2021:03:43:05 -0500] "-" 408 - "-" "-" 99334 commons.gc.cuny.edu

So I googled 408 response wordpress and an old BuddyPress thread popped up involving the sidebar login form:
https://buddypress.org/support/topic/bp-default-sidebar-login-not-working/#post-166903

The issue is the HTML ordering of a few of the login fields (<input type="hidden"> and <input type="submit">) can cause a 408 response.

It just so happens that we use a version of the BuddyPress sidebar login form in the admin bar login. I've just reordered the submit and hidden redirect field on production in wp-content/plugins/cac-bp-admin-bar-mods. I haven't committed it yet.

I'm not sure if the issue is really that simple, but it's worth a try.

Actions

Copy link

Updated by Boone Gorges almost 5 years ago

Whoa, could it really be that simple? I don't really understand the underlying 408 issue either :) But let's see what happens.

Actions

Copy link

Updated by Raymond Hoh almost 5 years ago

Spoke too soon :( Looks like there are still 408 responses coming in on lw2a. We seem to be getting at least a few of these every minute.

Here's a thread talking about blank 408 responses: https://serverfault.com/questions/473466/increased-number-of-408-request-timeouts-for-null-on-a-webserver

They theorize that it's the browser's predictive browsing feature that is causing this.

Actions

Copy link

Updated by Boone Gorges almost 5 years ago

You might be right that browser behavior is to blame for the handful of Sleep connections that we see on a regular basis in the logs.

But I think that these are a separate issue from the Sleeps that pile up during an outage. Those are not tied to 408s, they're tied to 200s. And they don't come from user IPs, they come from the Commons itself. (The call is coming from inside the house!) In both cases, the problem is manifested in Sleep connections, but I don't think they're the very same thing.

Actions

Copy link

Updated by Boone Gorges over 4 years ago

Target version set to Not tracked

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

CUNY Academic Commons

Custom queries

Bug #13949

Continued debugging of runaway MySQL connections

Updated by Boone Gorges almost 5 years ago

Updated by Boone Gorges almost 5 years ago

Updated by Raymond Hoh almost 5 years ago

Updated by Boone Gorges almost 5 years ago

Updated by Raymond Hoh almost 5 years ago

Updated by Boone Gorges almost 5 years ago

Updated by Boone Gorges over 4 years ago