lemmy.onlylans.io
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
lemmydividebyzero@reddthat.com to Technology@beehaw.orgEnglish · 10 hours ago

Aggressive AI scrapers are making it kinda suck to run wikis

weirdgloop.org

external-link
message-square
12
fedilink
52
external-link

Aggressive AI scrapers are making it kinda suck to run wikis

weirdgloop.org

lemmydividebyzero@reddthat.com to Technology@beehaw.orgEnglish · 10 hours ago
message-square
12
fedilink
Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
  • OptimusPrimeDownfall@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    4
    ·
    7 hours ago

    The problem is that you can’t block all scraping. The scrapers make their bots look like regular traffic, so even if you block all known scrapers, there will be tons that just look like humans visiting your site.

    • delmain@beehaw.org
      link
      fedilink
      arrow-up
      1
      ·
      5 hours ago

      Collaborative list of basic IP blocks of known scraper hosts.

      • Steve@startrek.website
        link
        fedilink
        arrow-up
        2
        ·
        4 hours ago

        They spoof their ip now

        • Hazelnoot [she/her]@beehaw.org
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 hours ago

          IP addresses can’t really be spoofed, but there are other issues that make IP-based filtering impractical. (VPNs, IPv6, malicious reporting, shared IPs, NAT, etc)

      • OptimusPrimeDownfall@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        changing IPs is very easy.

Technology@beehaw.org

technology@beehaw.org

Subscribe from Remote Instance

Create a post
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

  • Free and Open Source Software
  • Programming
  • Operating Systems

This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 103 users / day
  • 611 users / week
  • 1.98K users / month
  • 6.22K users / 6 months
  • 5 local subscribers
  • 43K subscribers
  • 5.38K Posts
  • 97.8K Comments
  • Modlog
  • mods:
  • alyaza [they/she]@beehaw.org
  • TheRtRevKaiser@beehaw.org
  • gyrfalcon@beehaw.org
  • rs5th@beehaw.org
  • coldredlight@beehaw.org
  • Leigh@beehaw.org
  • TheRtRevKaiser@kbin.social
  • Chris Remington@beehaw.org
  • BE: 0.19.8
  • Modlog
  • Instances
  • Docs
  • Code
  • join-lemmy.org