Bug #25065
openCloudflare captcha gating anonymous HTML fetches of *.commons.gc.cuny.edu
0%
Description
Quick follow-up related to #24987 (thanks again for the WAF exception that fixed the wp-json case).
Noticed a related behavior on the public-facing side worth a look. Anonymous HTML fetches of my site's home page get captcha-walled by Cloudflare, even from non-suspicious user agents.
Reproducer¶
curl -A "Mozilla/5.0 (compatible; LinkPreview)" https://khatchad.commons.gc.cuny.edu/
Response is a "Captcha Required" interstitial (HTML body containing the challenge, no actual page content).
Two questions¶
- Is this anonymous-fetch challenge intentional, or is the WAF being aggressive with user agents Cloudflare doesn't recognize?
- Are there explicit exemptions configured for social-media preview bots (LinkedIn, Twitter/X, Slack, Mastodon, Facebook)? My main concern is link previews for shared posts coming up blank.
Major search-engine crawlers (Googlebot, Bingbot) typically have verified-IP allowlists that bypass these, so I'm less worried about indexing impact -- but worth confirming.
Thanks,
Raffi
Updated by Raymond Hoh 6 days ago
Raffi, I've edited the ticket description so it uses the ticket's "Description" field rather than the "Deployment Actions" field meant for internal usage.
Just going to quickly reply to your points.
Is this anonymous-fetch challenge intentional, or is the WAF being aggressive with user agents Cloudflare doesn't recognize?
Probably the latter, which would explain the curl command not working.
Are there explicit exemptions configured for social-media preview bots (LinkedIn, Twitter/X, Slack, Mastodon, Facebook)? My main concern is link previews for shared posts coming up blank.
We've had issues where the WAF has been a bit more aggressive. See #24092 for Facebook and LinkedIn. If you are running into issues with other social media sites, let us know.
Updated by Boone Gorges 6 days ago
I agree with Ray that this is very likely part of Cloudflare's general algorithms and heuristics for identifying malicious bot traffic. I've sent off an inquiry to our host, and perhaps they'll provide us with some more context. In the meantime, let us know if you discover any actual functional problems that arise from the behavior.
Updated by Raffi Khatchadourian 5 days ago
Yup, I'm seeing it on LinkedIn; no link preview, which used to come up.
Updated by Raffi Khatchadourian 2 days ago
Following up with concrete details and a possible remediation.
Confirmed user-facing impact: LinkedIn link previews for shared *.commons.gc.cuny.edu URLs now come up blank (per note #5)—so this breaks real functionality, not just curl tests.
Suggested allowlist (link-preview bots): a WAF/Cloudflare rule that skips the Managed Challenge for known preview crawlers would fix previews directly. User agents to allow:
facebookexternalhit(Facebook/Meta)LinkedInBot(LinkedIn)Twitterbot(X/Twitter)Slackbot-LinkExpanding(Slack)Discordbot(Discord)WhatsApp(WhatsApp)TelegramBot(Telegram)Mastodon/*(Mastodon instances)Embedly/Iframely(generic preview services)
Cloudflare's built-in "Verified Bots" allow, plus a custom rule (e.g. skip challenge when cf.client.bot is true OR the UA matches the list above), should cover it.
Updated by Boone Gorges 2 days ago
Hi Raffi - Thanks for the additional info.
I've been communicating with the host about this issue. They confirmed that, as your previous update suggests, Cloudflare has general rules that use various heuristics to block suspicious traffic. As you might guess, it's not possible for us to turn this off across the board, but they'll work with us to loosen rules as necessary to allow legitimate functionality. (User agent strings like the ones that you have assembled are not ideal for this purpose because they can be, and in fact are, easily and frequently spoofed, though they can of course be part of a heuristic.)
It's interesting that you say that Linkedin links in particular are not working. Our host indicated that they made WAF rule changes on Friday to update the IP ranges used by Microsoft for Linkedin, and that the Linkedin Post Inspector was correctly fetching your site. See https://www.linkedin.com/post-inspector/inspect/https:%2F%2Fkhatchad.commons.gc.cuny.edu If you're seeing otherwise, can you please share specific steps to reproduce? Such as a link to a page on LinkedIn that ought to be pulling in content from the Commons, but is not.