• daniskarma@lemmy.dbzer0.com
      link
      fedilink
      arrow-up
      1
      arrow-down
      1
      ·
      2 months ago

      Not really. I only ask because people always say it’s for LLM training, which seem a little illogical to me, knowing the small number of companies that have access to the computer power to actually do a training with that data. And big companies are not going to scrape hundreds of times the same resource for a piece of information they already have.

      But I think people should be more critique trying to understand who is making the request and with which purpose. So then people could make a better informed decision of they need that system (which is very intrusive for the clients) or not.

        • daniskarma@lemmy.dbzer0.com
          link
          fedilink
          arrow-up
          1
          arrow-down
          1
          ·
          edit-2
          2 months ago

          Most of those companies are what’s called “gpt wrappers”. They don’t train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.

          For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.

          Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don’t think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.