Cloudflare blocking AI crawlers
from 3dcadmin@lemmy.relayeasy.com to selfhosted@lemmy.world on 01 Jul 12:29
https://lemmy.relayeasy.com/post/3097

Cloudflare trying to stop AI crawling somehow!

https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

#selfhosted

threaded - newest

teft@lemmy.world on 01 Jul 13:20 next collapse

Seeing as how they can’t reliably detect that I’m human or not, I don’t have much confidence in this.

i_am_not_a_robot@discuss.tchncs.de on 01 Jul 13:41 next collapse

It’s relatively easy for Cloudflare to profile clients as being web scrapers. A concerning amount of internet traffic goes through their servers in plain text.

FundMECFSResearch@lemmy.blahaj.zone on 02 Jul 01:32 collapse

Yeah. Me choosing to use a vpn and a privacy respecting browser has earnt me a constant captcha

Tywele@lemmy.dbzer0.com on 02 Jul 06:33 collapse

For me just using Firefox on Linux seems to be enough to trigger them.

anomnom@sh.itjust.works on 02 Jul 13:58 collapse

Apple’s private relay does this too. And so does auto-login.

lambalicious@lemmy.sdf.org on 01 Jul 14:10 next collapse

Is that why I can no longer go from a web search (eg.: DDG, Ecosia) or forum link to StackOverflow without going through three CF captchas? If AI had not killed SO for me before, this does.

grue@lemmy.world on 01 Jul 14:30 next collapse

Yeah, it’s only anecdotal but I feel like hobbyists like us, who do slightly unusual things without nefarious intent, who are the ones who get hit with these sorts of issues the most. For example, I’ve noticed that some websites start throwing captchas at me or even just straight-up refuse to load with 403: unauthorized errors because I have my router set up to load-balance across two Internet connections. (At least, that’s my guess as to why it’s happening.)

Buelldozer@lemmy.today on 01 Jul 14:52 next collapse

For example, I’ve noticed that some websites start throwing captchas at me or even just straight-up refuse to load with 403: unauthorized errors because I have my router set up to load-balance across two Internet connections. (At least, that’s my guess as to why it’s happening.)

I maintain several multi-wan commercial setups and they don’t have this problem. I obviously don’t know what your setup is but I’d guess something is wrong with how its handling flows / connections. Once a connection is established between your edge and an internet resource that flow should remain “stuck” to whatever wan port it started with and it sounds like that isn’t happening.

grue@lemmy.world on 01 Jul 16:42 collapse

Could very well be. I’m using OpenWRT and basically did the bare minimum to get it to work.

Scrollone@feddit.it on 02 Jul 05:54 collapse

Ahh yes. Imgur simply don’t work anymore at my place, it always errors out with 403.

irmadlad@lemmy.world on 01 Jul 18:06 collapse

I’ve seen captchas for years before the recent influx of AI. It’s the way I go about obfuscating network activities that the site security cannot determine if I am a bot on not. There is a Captcha Buster extension for Firefox. If the captcha is ‘Pick the three busses from these blurry, pixelated set of pictures’ then I can solve those easily. It’s when the captcha is a full page of a motorcycle and you have to check all the relevant pieces, then on to the next full picture, that chap me. So you click Captcha Buddy and it ‘listens’ to the audio portion of the captcha, then solves it. It’s not 100% on all types of captchas, but it 90% of the time it works every time. It’s interesting to me that after a while, you start to notice patterns in the captcha images. For instance if the directions are ‘Pick the fire hydrants’, there will be at least 5 you have to pick. Crosswalks are the same way too.

I’d much rather have to do captchas than have my jimmy out in the ether traffic. Anecdotal, but Stack Overflow doesn’t trigger a captcha for me. All I get is the cookie popup.

Andres4NY@social.ridetrans.it on 01 Jul 18:11 collapse

@irmadlad @lambalicious I just manually do the audio captcha. Every time. Because the picture captchas often don't work correctly for me.

It does bug me a little that I don't know what the audio captcha is being used for - am I helping an amazon echo transcribe whatever it is surreptitiously listening to?

irmadlad@lemmy.world on 01 Jul 18:20 collapse

am I helping an amazon echo transcribe whatever it is surreptitiously listening to?

I’ve always wondered where the hell they scrape all that audio from. I mean, it’s random shit.

lambalicious@lemmy.sdf.org on 02 Jul 04:13 collapse

Gotta be physicists or fanfic writers. I can not imagine other better options.

irmadlad@lemmy.world on 02 Jul 18:08 collapse

idk…Some of the stuff I’ve heard sounds like they eavesdropped in on a board room roundtable. Other stuff sounds like instructions how to install something. They probably are siphoning data off YT.

lambalicious@lemmy.sdf.org on 03 Jul 03:33 collapse

off YT

so… physicists and fanfic writers, yeah :p

TuxEnthusiast@sopuli.xyz on 01 Jul 16:47 next collapse

Anubis! github.com/TecharoHQ/anubis

cmnybo@discuss.tchncs.de on 02 Jul 06:57 collapse

That uses proof of work rather than just detecting and blocking the bots.

3dcadmin@lemmy.relayeasy.com on 01 Jul 17:10 next collapse

Seen plenty of people who think this is a bad thing, do they just want everything to be crawled. I mean I don’t think this is the saviour but it has got to be better than wholesale theft

GenderNeutralBro@lemmy.sdf.org on 01 Jul 18:07 collapse

do they just want everything to be crawled

Yes. Web crawling has been a normal and vital part of the web from day 1. We’d have no search engines without crawlers.

The web is user-centric by design. I’m sick of tech companies trying to flip the script and hoard information, most of which is not theirs to begin with (e.g. Google, Reddit, Twitter, Facebook, etc.).

Prontomomo@lemmy.world on 01 Jul 22:19 collapse

I don’t think this blocks crawlers. About 1/5 websites uses cloudflare, the significant thing here’s is that AI scraping is now blocked by default on most of those sites, NOT crawling

daniskarma@lemmy.dbzer0.com on 01 Jul 18:24 next collapse

How does it differentiate an “AI crawler”, from any other crawler? Search engine crawler? Someone monitoring data to offer statistics? Archiving?

This is not good. They are most likely doing the crawling themselves and them selling the data to the best bidder. That bidder could obviously be openAI for all we know.

They just know that introducing the sentence “this is anti AI” a lot of people is not going to question anything.

_cryptagion@lemmy.dbzer0.com on 02 Jul 00:31 collapse

Well, they have access to logs showing who connects to 24 million websites, how they use those websites, and for how long. So if there’s anyone who knows what traffic is crawlers, and which crawlers are AI, it’s Cloudflare. There’s no way they wouldn’t know, they have all the data they would ever need to figure it out. In fact, there’s nobody on the internet who is better positioned to be able to identify AI crawlers than Cloudflare.

x00z@lemmy.world on 02 Jul 14:55 collapse

This.

They also have a form to submit AI crawlers.

CloudFlare can also easily maintain an anti AI crawler service completely by itself if it takes a fee on top of their pay per crawl functionality. However, considering CloudFlare already has all the tools and infrastructure to do this cheaply, providing a good service wouldn’t be too hard.

electric_nan@lemmy.ml on 01 Jul 18:29 next collapse

All this discussion about captchas raises a question for me: if fingerprinting is so accurate and easy, that ublock, no cookies and a VPN don’t help… then why the fuck do I have to keep doing captchas?

DoucheBagMcSwag@lemmy.dbzer0.com on 01 Jul 18:46 next collapse

Because it never was about security. You’re training LLMs for free.

I’m pretty sure some auto drive company is getting the advantage since a lot of captchas are spotting crosswalks, traffic lights, stairs, busses, mountains, motorcycles etc. Wonder if it’s fucking tesla

irmadlad@lemmy.world on 01 Jul 23:25 collapse

I’m pretty sure some auto drive company is getting the advantage

I’d recon that a lot of that is spliced from pictures captured from Google Map vehicles.

W3dd1e@lemmy.zip on 02 Jul 05:25 collapse

Both you and @DoucheBagMcSwag@lemmy.dbzer0.com are correct. Google bought reCAPTCHA in 2012.

Here’s an article about it from 2018.

(╯°□°)╯︵ ┻━┻

Captcha if you can: how you’ve been training AI for years without realising it

And another from 2019! Captchas got harder for us because the AI had learned from our training.

Why CAPTCHAs have gotten so difficult

DoucheBagMcSwag@lemmy.dbzer0.com on 02 Jul 05:31 next collapse

Fucking hell

irmadlad@lemmy.world on 02 Jul 17:49 collapse

A few years ago I picked up an online gig with a company that trained AI. You’d log in to your dashboard and be presented with questions you had to answer in the best way, such as ‘Is the earth round?’. Well, it’s round in nature but is not perfectly round. So you’d have to pick the best solution from the answer list. It was interesting, but tedious. It put taters on the table, so I got that going for me…which is nice.

interdimensionalmeme@lemmy.ml on 03 Jul 03:57 collapse

To punish you for trying to protect yourself. To extract micro-labour out of you (AI training) To discourage you from privacy best practices Oh BTW, the captcha will eventually contain unblockable ads

Deebster@infosec.pub on 01 Jul 18:33 collapse

FYI, you’ve added a link where the label is the URL and the actual link is empty. You can fix this by removing the [ and ]() around the link. If the link is there as plain text, it gets a hyperlink automatically: arstechnica.com/…/pay-up-or-stop-scraping-cloudfl…

3dcadmin@lemmy.relayeasy.com on 02 Jul 16:14 collapse

It was minding it’s own business and adding them… lol 😀