Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
I didnt see this mentioned- for all the old non-cashed stuff, would it help to throttle it by like 5-10 seconds per request?
What we need is a system that detects scrapers, and feeds them an alternative poisoned data set.