Project

General

Profile

Actions

Support #20473

open

Broken Link Checker plugin wreaking havoc

Added by Marilyn Weber 18 days ago. Updated 5 days ago.

Status:
Assigned
Priority name:
Normal
Assignee:
-
Category name:
-
Target version:
Start date:
2024-06-12
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

Via Keeping:

Hi Commons Team,

You're possibly already familiar with most of this issue:

Thanks to Ming in GC IT working with someone on your team, the Library has learned that there’s a Commons site

https://khatchad.commons.gc.cuny.edu/research/publications which is causing one of our publishers to intermittently shut down access. On this page, the faculty member has a list of his publications, with download links. The Broken Link Checker plugin pings those destination URLs, some include doi.org links, on a regular basis (every 72 hours, by default, hence these notifications from the publisher have more or less always have occurred at the same time of the day from the same IP). Although a legitimate use of the publicly available dpi.org URLs, the publisher is concerned about bulk content-harvesting, although no content is being harvested; they think content is being harvested because all requests occur within 3-4 seconds of each other. We’ve shared this with the publisher, but they continue to shut down access every 3 days.

I’m sharing this because maybe the Commons Team could:

Discover an alternative plugin that isn’t so persistent?

or

Somehow rewrite the plugin so the gap between pings is not 3-4 seconds, so the false positive appearances of harvesting do not continue to persist.

or

Remove the plugin from the Commons

Because the Library is concerned that access to this particular resource and possibly others might be permanently shut down.

Thank you for considering and investigating.

Best,
Stephen

Stephen Klein
Digital Services Librarian
CUNY Graduate Center "


Files

audit_log (1).txt (5.74 KB) audit_log (1).txt Stephen Klein, 2024-06-25 09:55 AM
Actions #1

Updated by Boone Gorges 17 days ago

  • Status changed from New to Reporter Feedback

Thanks for laying out the details, Stephen.

Some of the language you've shared here ("legitimate use of public URLs", etc) comes from me, and is the result of previous investigations I made a result of the publisher's complaints. I was hopeful that they'd read the explanation and say "Oh, this is actually not an issue", and then configure their system so that it didn't ban based on this behavior. But I guess my hope was misplaced.

That being said, I'm extremely reluctant to take dramatic steps on the Commons just because their team is unable or unwilling to address what is, IMO, a clear false positive in their scanning tools. On my view, banning the plugin would count as a "dramatic step". I also looked into request throttling - in essence, stretching out the time between requests. But this is not obviously possible in the plugin as written, and even if it were, it would cause new problems because the broken-link-checker processes would remain open for such a long time due to the pauses.

I think that the best option for the time being may be to tell the plugin not to check links from the problematic domains. The plugin has a tool built in for this purpose; see the 'Exclusion list' field at Dashboard > Broken Links > Exclusion list. As a test, I'm going to add to the 'Exclusion list' on Raffi's site the domains that were originally reported to me, which were: acm.org, computingreviews.com

Stephen, your report mentions doi.org (and dpi.org, though perhaps this was a typo). Have we gotten reports from the DOI Foundation itself? That seems much more problematic, and also unlikely, since AFAIK they simply redirect traffic to the publishers.

If my 'Exclusion list' approach works to placate the arbitrary rate limits of these specific publishers, I can put a tool in place that will enforce the restriction across the entire Commons (so that any plugin running broken-link-checker will have the domains added to their own exclusion list).

Actions #2

Updated by Marilyn Weber 17 days ago

Thanks, Boone, I let him know.

Actions #3

Updated by Marilyn Weber 16 days ago

And he replies "Thanks for all your work on this, Boone. Btw, long time and I hope that you are doing well.

Thanks for adding the domains to the exclusion list of Raffi's site. Yes, dpi.org was a typo and no we have not heard from the DOI Foudation only ACM. I love the idea of a tool to enforce the restriction across the entire Commons, so let's see if this test works. I will report back if the publisher notifies us, which would mean a failed test."

Actions #4

Updated by Marilyn Weber 13 days ago

Another reply- "So far, so good, so maybe the solution you put into place worked.
I just want to point out that the Computing Reviews URL is computingreviews.com, not computerreviews.org; I shared the incorrect URL."

I told him I'm closing the ticket but to be in touch if needed.

Actions #5

Updated by Boone Gorges 12 days ago

  • Target version set to 2.4.1

Thanks for the follow-up. I'd like to hold this ticket open for maybe another week or two. If we don't hear anything, I'll circle back and create a more thorough version of my domain blacklist.

Actions #6

Updated by Marilyn Weber 7 days ago

Stephen also asks "
Since your 'Exclusion list' approach appears to have resolved the issue of arbitrary rate limits of this publisher, I am wondering if you still intend to put a tool in place that will enforce restrictions across the entire Commons?"

Actions #7

Updated by Boone Gorges 6 days ago

  • Status changed from Reporter Feedback to Staged for Production Release

Yes, as noted last week, my plan was to wait a week or two to see if the problem was mitigated. Since that seems to be the case, I've put a global block in place against the two domains in question https://github.com/cuny-academic-commons/cac/commit/c6495170d5c36d0b570133836c9a8e2f44efecd6. This will be part of tomorrow's maintenance release.

Actions #8

Updated by Marilyn Weber 6 days ago

Can we add Stephen to Redmine?

Here's his latest:
"I’ve got bad news:

Our e-resources librarian has reported that we received another "excessive session" notification from ACM on 6/21.

Best,
Stephen"

Actions #9

Updated by Boone Gorges 6 days ago

I've reactivated Stephen's dormant Redmine account and added him as a watcher.

Did ACM provide any more log files? Before taking next steps, I need to ascertain whether this is the very same set of URLs (which would indicate that this is a bug in Broken Link Checker's "exclusion_list" tool) or whether it's coming from somewhere else on the Commons.

Actions #10

Updated by Boone Gorges 6 days ago

A quick follow-up, having added Matt and Colin as watchers:

As obnoxious as it may be, we should consider the possibility that we'll simply need to disable Broken Link Checker across the entire network. It's not acceptable that this nice-to-have but not critical functionality on the Commons would cause the library to be banned from important services.

Actions #11

Updated by Raffi Khatchadourian 6 days ago

FWIW, I think the blacklist functionality is working. Last Friday, I asked the link checker to recheck individual ACM links and it seemed to refuse to do so.

Actions #12

Updated by Stephen Klein 5 days ago

The activity on Fri Jun 21 09:02:48 PDT 2024 is coming from:
146.96.128.221

Attaching the detailed log.

Actions #13

Updated by Raffi Khatchadourian 5 days ago

I wonder if they are somehow counting redirects. I've appended "doi.org" to the blacklist.

Actions #14

Updated by Stephen Klein 5 days ago

Thanks, Raffi.

Actions #15

Updated by Boone Gorges 5 days ago

Stephen, thanks for passing along the log. You can see by the timestamps that the plugin made 35 requests in the course of 4 minutes. As I noted in my initial analysis, this level of resource usage is extremely minor, and underscores the unreasonably high sensitivity of their filters.

Raffi, you're almost certainly correct that ACM is counting redirects from doi.org. Adding doi.org to the block list is probably going to work, for that reason. However, blocking doi.org is like blocking t.co or Bitly or other URL shorteners, or indeed like removing the plugin altogether. In all these cases, we're blocking access to a tool whose intended use is quite broad, because a very small subset of affected sites are unhappy with the way the tool works.

The CAC dev team will talk about this issue during today's standing call to see if we can come up with a path forward.

Actions #16

Updated by Raffi Khatchadourian 5 days ago

Boone Gorges wrote in #note-15:

...

Raffi, you're almost certainly correct that ACM is counting redirects from doi.org. Adding doi.org to the block list is probably going to work, for that reason. However, blocking doi.org is like blocking t.co or Bitly or other URL shorteners, or indeed like removing the plugin altogether. In all these cases, we're blocking access to a tool whose intended use is quite broad, because a very small subset of affected sites are unhappy with the way the tool works.

I'm OK with blacklisting "doi.org." I'm pretty confident that DOI links will persist.

Actions #17

Updated by Boone Gorges 5 days ago

  • Status changed from Staged for Production Release to Assigned
  • Target version changed from 2.4.1 to 2.4.2

Thanks, Raffi. I don't think that doi.org redirects will break, but the destination URL might, which is something that broken-link-checker would presumably catch.

For the time being, the global acm.org and computingreviews.com blocks have gone into the 2.4.1 release. I'm going to hold this ticket open while we consider follow-up steps.

Actions #18

Updated by Stephen Klein 5 days ago

In the old days, I was only received notifications for the tickets that I created and still want that setting, because I just received over 20 notifications. Is there a way to remove me from all notifications other than the ones I care about?

Actions #19

Updated by Stephen Klein 5 days ago

Thanks, Boone.

Actions #20

Updated by Boone Gorges 5 days ago

Stephen, that's a setting at https://redmine.gc.cuny.edu/my/account. I've made the change for you.

Actions #21

Updated by Raffi Khatchadourian 5 days ago

Boone Gorges wrote in #note-17:

Thanks, Raffi. I don't think that doi.org redirects will break, but the destination URL might, which is something that broken-link-checker would presumably catch.

Sorry, Boone. I meant to say that I'm also pretty confident that the destination URLs coming from doi.org will be persistent.

Actions #22

Updated by Raymond Hoh 5 days ago

Stephen Klein wrote in #note-12:

The activity on Fri Jun 21 09:02:48 PDT 2024 is coming from:
146.96.128.221

Attaching the detailed log.

This isn't going to add much to the discussion, but I just took a look at this log and can confirm that Broken Link Checker is the cause of the external pings.

The user agent in the log is from an older version of Chrome from close to two years ago. Since this Chrome version is quite old, I checked this user agent against our codebase and I can confirm that this user agent is used by Broken Link Checker -- https://plugins.trac.wordpress.org/browser/broken-link-checker/tags/2.2.4/legacy/modules/checkers/http.php#L179 .

I think adding the content publisher's domains to BLC's exclusion list should resolve this issue.

Actions #23

Updated by Stephen Klein 5 days ago

Thank you, Raymond. Appreciate your work on this.

Actions #24

Updated by Boone Gorges 5 days ago

Let's roll with Raffi's blocking of doi.org for a week or two. Stephen, please let us know if you get more hate mail from the publisher. If not, the Commons team agrees that adding doi.org to the global exclusion list for broken-link-checker is the logical next step.

Actions #25

Updated by Raffi Khatchadourian 5 days ago

BTW, I don't think that ACM is tracking the redirects on second thought. What I feel is happening is that the plug-in is resolving the redirect and then checking the resolved destination URL.

If that's true, perhaps the bug in the plug-in is that it doesn't consult the blacklist when resolving redirects.

Actions #26

Updated by Raymond Hoh 5 days ago

According to the BLC codebase, if a link is excluded, then it is omitted from link checks entirely: https://plugins.trac.wordpress.org/browser/broken-link-checker/tags/2.2.4/legacy/core/core.php#L3086

If a link isn't excluded, then it looks like the BLC codebase can do up to two checks for a link:

This lines up with Stephen's provided audit_log(1).txt as each link is hit twice. It looks like we can omit this second check by using the following snippet -- add_filter( 'blc_use_get_checker', '__return_true' );

Actions #27

Updated by Stephen Klein 5 days ago

Boone Gorges wrote in #note-24:

Let's roll with Raffi's blocking of doi.org for a week or two. Stephen, please let us know if you get more hate mail from the publisher. If not, the Commons team agrees that adding doi.org to the global exclusion list for broken-link-checker is the logical next step.

Thanks, Boone. I will.

Actions #28

Updated by Raffi Khatchadourian 5 days ago

Raymond Hoh wrote in #note-26:

According to the BLC codebase, if a link is excluded, then it is omitted from link checks entirely: https://plugins.trac.wordpress.org/browser/broken-link-checker/tags/2.2.4/legacy/core/core.php#L3086

If a link isn't excluded, then it looks like the BLC codebase can do up to two checks for a link:

Raymond, I believe you have confirmed my suspicion! The core module checks if it is "allowed" to check the link. For example, before I added "doi.org" to the blacklist today, the core module would then proceed to check a doi.org link. Now, the doi.org link resolves to an ACM link because it is a redirecting service. What actually happens is that, despite ACM being on the blacklist, the http module never checks whether it is allowed to check the link or not. Thus, we have a situation where a link is checked despite being on the blacklist, which is what I think was happening last Friday.

This is all happening because of this line in the http module: https://plugins.trac.wordpress.org/browser/broken-link-checker/tags/2.2.4/legacy/modules/checkers/http.php#L192:

curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );

Here, curl is set follow redirects (see https://curl.se/libcurl/c/CURLOPT_FOLLOWLOCATION.html). What it should do instead is not follow redirects and instead:

1. Resolve the redirect.
2. Check if the destination URL is on the blacklist before following it.

In other words, following redirects should be done "manually" since there is a blacklist to be consulted. The http module doesn't have the blacklist.

Actions #29

Updated by Boone Gorges 5 days ago

In fairness to the plugin authors, I don't consider this a "bug". The purpose of the plugin is to identify broken links. A link that contains one or more 30x redirects is "broken" if one or more of the links in the redirect chain is broken. So it's perfectly natural for the plugin to follow redirects.

It would be a nifty feature if the plugin instead did not follow redirects blindly, instead going to link 0 in the chain, then 1, then 2, and doing allowed/disallowed checks on each one of these. But building this is non-trivial, if for no other reason than that it'll increase execution time considerably, which could cause resource utilization issues. So other mitigations would be necessary.

More generally, doing this kind of improvement to a third-party plugin that is not central to our work on the Commons is not viable. We try to be good stewards of the tools we use by volunteering fixes and improvements, but we also have to be good stewards of our limited development resources. So I think that we'll probably have to be satisfied with other solutions, like the proposed block on doi.org URLs (ie, any redirect chain that has doi.org as item 0)

Actions

Also available in: Atom PDF