Histomat of F/OSS: We should reclaim LLMs, not reject them

洪民憙 (Hong Minhee)@lemmy.ml · 2 months ago

Histomat of F/OSS: We should reclaim LLMs, not reject them

bizarroland@lemmy.world · 2 months ago

LLMs are tools. They’re not replacements for human creativity. They are not reliable sources of truth. They are interesting tools and toys that you can play with.

So have fun and play with them.

selokichtli@lemmy.ml · 2 months ago

See, it’s not fun for the planet.

HiddenLayer555@lemmy.ml · 2 months ago

Locally run models use a fraction of the energy. Less than playing a game with heavy graphics.

selokichtli@lemmy.ml · edit-2 2 months ago

Yes, more or less. But the issue is not about running local models; that’s fine even if it’s only for curiosity. The issue is about shoving so-called AI in every activity with the promise it will solve most of your everyday problems, or for mere entertainment. I’m not against “AI”, I’m against the current commercialization attempts to monopolize the technology by already huge companies that will only seek profit, no matter the state of the planet and the other non-millionaire people. And this is exactly why even a bubble burst is concerning to me, as the poor are the ones that will truly suffer the consequences of billionaires betting in their mansions with their spare palaces.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 months ago

The actual problem is the capitalist system of relations. If it’s not AI, then it’s bitcoin mining, NFTs, or what have you. The AI itself is just a technology, and if it didn’t exist, capitalism would find something else to shove down your throat.

bizarroland@lemmy.world · 2 months ago

Neither are most of human endeavors.

And if you think about the fact that this AI bubble is going to be a massive collapse and crash the finances of America and cause a massive regression in conservative policy and a massive progression of liberal policy, (since the playbook has always been for the conservatives to hand the reins over to the liberals until they fix the financial system of America when the conservatives break it), then it’s actually a good thing. We’re just in its bad phase.

selokichtli@lemmy.ml · 2 months ago

I expect it’s a bubble that will burst. Climate change is no joke and only very stubborn people keeps denying it. AI is not like the massive use of combustion-based energy. That was strike two.

Cowbee [he/they]@lemmy.ml · 2 months ago

Well-said. LLMs do have some useful applications, but they cannot replace human creativity nor are they omniscient.

Sunsofold@lemmings.world · 2 months ago

Mostly just toys.

If you can’t rely on them more (not ‘just as much,’ more) than the people who would do whatever the task is, you can’t use them for any important task, and you aren’t going to find a lot of tasks which are simultaneously necessary and yet unimportant enough that we can tolerate rolling nat 1s on the probability machine all the time.

kadu@scribe.disroot.org · 2 months ago

We should reject them.

Zerush@lemmy.ml · 2 months ago

LLM are the future, but we must still learn to use it correctly. The energy problem depends mainly on 2 things, the use of fossil energy and the abuse of AI including it without need in everything, because the hype, as data logging tool for Big Brother or biased influencers.

You don’t need a 4x4 8 cylinder Pick-up to go 2km to the store to buy bread.

dontblink@feddit.it · 2 months ago

It’s simply another case where we have amazing technologies but we lack the right ways to use them, that’s what our culture does: creating amazing techs that can solve lots of human problems and then discarding the part that actually solves a problem unless it’s also profitable for the individual.

It literally is a problem of people wanting to submit other people for power games, that’s not how all societies work, but that’s a foundation for ours, but we’re playing this game so much that we almost broke the console (planet earth and our own bodies health).

It’s an anthropological problem, not a technological one.

Zerush@lemmy.ml · 2 months ago

This is the point, We have big advances in tech, physic, medicine. science…thanks to AI. But the first use we give it is to create memes, reading BS chats, and build it in fridges, or worst, build it in weapons to kill others.

Rioting Pacifist@lemmy.world · 2 months ago

What advances?

Zerush@lemmy.ml · 2 months ago

AI in medicine permits the analysis of contagious diseases and the corresponding manufacture of treatments and vaccines in a fraction of the time compared to traditional methods. The manufacture of new materials, research and optimisation in physical, meteorological, and environmental processes, which without AI would have been impossible. The positive effects of AI are undeniable. But as it was said, negative its implementation, its way of using it by people like a child with a new toy, why it is fashionable and cool or biased (commercial or/and politically) AI by big corporations, with AI build in even a Toaster as selling argument.

Artificial Intelligence isn’t the real problem, but human intelligence and ethics.

Rioting Pacifist@lemmy.world · 2 months ago

Do you have examples?

Because most of what you are listing is stuff that has been using ML for years (possibly decades when it comes to meteorology) and just slapped “AI” on as a buzzword.

someacnt@sh.itjust.works · 2 months ago

I feel like they are confounding LLMs AND general AI/ML. The latter is useful in many areas, while the former is mostly hype imo.

Zerush@lemmy.ml · 2 months ago

AI exist since the first Chess Bot. Naturally due to the limited HW power in these years, AI applications where very limited. It bcame presenz with the current computing capability, thousends of times more powerfull, see the differences between the PC only the last 25-30 years, nothing to do. even a current low cost smartphone is way better as an high end PC from 15 years ago. It’s currently a hype with more than 10.000 AI apps and a cpmpetition between developers and big corporations, with users which abuse it with and for crappy results without common sense, as toy instead of an tool to help in the tasks as it should be, and not for substitute the own work and research. Reason because Bandcamp ban all music made with AI, to protect the artist and their work (https://lemmy.ml/post/41786760). As Example what i mean, it is not the same to use AI to help in a task, as writing aprompt an let the AI do your work, your painting, your music, your research, to sell it as own (mostly without even contrasting it)

☆ Yσɠƚԋσʂ ☆@lemmy.ml · edit-2 2 months ago

here are just a few

FoundFootFootage78@lemmy.ml · 2 months ago

LLM’s in particular don’t use that much energy. Image and video generation are the real concerns.

Zerush@lemmy.ml · 2 months ago

Well, if one user ask something to a LLM, there are certainly not much sources needed, but yhere are millons of users doing it to thousends of different LLM. That need a lot of server power. Anyway, it’s not the primary problem with renevable energy sources, the risks are others, biased information, deep fake, privacy, etc., with the misuse by corporations and political collectives.

Matt@lemmy.ml · 2 months ago

You don’t need a 4x4 8 cylinder Pick-up to go 2km to the store to buy bread.

In the U.S., yes.

Zerush@lemmy.ml · 2 months ago

I was referring to civilised first world countries

HubertManne@piefed.social · 2 months ago

no way you could get to the store with only 8 cylinders. what are we? animals!

☂️-@lemmy.ml · edit-2 1 month ago

deleted by creator

chgxvjh [he/him, comrade/them]@hexbear.net · edit-2 2 months ago

Instead of trying to prevent LLM training on our code, we should be demanding that the models themselves be freed.

You can demand it but it’s not an pragmatic demand as you claim. Open weight models aren’t equivalent to free software, they are much closer proprietary gratis software. Usually you don’t even get access to the training software and the training data and even if you did it would take millions of capital to reproduce them.

But the resulting models must be freed. Any model trained on this code must have its weights released under a compatible copyleft license.

You can put into your license whatever you want but for it to be enforceable it needs to grant licensee additional rights they don’t already have without the license. The theory under which tech companies appear to be operating is that they don’t in fact need your permission to include your code into their datasets.

block the crawlers, withdraw from centralized forges like GitHub

Moving away from github has become a good idea since Microsoft has purchased it years ago.

You kind of need to block crawlers because of you host large projects they will just max out your servers resources, CPU or bandwidth whatever is the bottleneck.

Github is blocking crawlers too, they have restricted rate limits a lot recently. If you are using nix/nixos which fetches a lot of repositories from github you often can’t even finish a build without github credentials nowadays with how rate limited github has become.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 months ago

You can demand it but it’s not an pragmatic demand as you claim. Open weight models aren’t equivalent to free software, they are much closer proprietary gratis software. Usually you don’t even get access to the training software and the training data and even if you did it would take millions of capital to reproduce them.

This is a problem that can be solved by creating open source community tools. The really difficult and expensive part is doing the initial training.

You can put into your license whatever you want but for it to be enforceable it needs to grant licensee additional rights they don’t already have without the license. The theory under which tech companies appear to be operating is that they don’t in fact need your permission to include your code into their datasets.

There have been numerous copyleft cases where companies were forced to release the source. There’s already existing legal precedent here.

chgxvjh [he/him, comrade/them]@hexbear.net · 2 months ago

If there is no license needed to throw open source project on the training data pile, then there is no case.

Matt@lemmy.ml · 2 months ago

The problem is not the algorithm. The problem is the way they’re trained. If I made a dataset from sources whose copyright holders exercise their IP rights and then train an LLM on it, I’d probably go to jail or just kill myself (or default on my debts to the holders) if they sue for damages.

jackmaoist [none/use name]@hexbear.net · 2 months ago

I support FOSS LLMs like Qwen just because of that. China doesn’t care about IP bullshit and their open source models are great.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 months ago

Exactly, open models are basically unlocking knowledge for everyone that’s been gated by copyright holders, and that’s a good thing.

Rioting Pacifist@lemmy.world · 2 months ago

Seems like the easiest fix is to consider the produce of LLMs to be derivative products of the training data.

No need for a new license, if you’re training code on GPL code the code produced by LLMs is GPL.

Joe@discuss.tchncs.de · 2 months ago

Let me know if you convince any lawmakers, and I’ll show you some lawmakers about to be invited to expensive “business” trips and lunches by lobbyists.

Rioting Pacifist@lemmy.world · 2 months ago

The same can be said of the approach described in the article, the “GPLv4” would be useless unless the resulting weights are considered a derivative product.

A paint manufacturer can’t claim copyright on paintings made using that paint.

Joe@discuss.tchncs.de · edit-2 2 months ago

Indeed. I suspect it would need to be framed around national security and national interests, to have any realistic chance of success. AI is being seen as a necessity for the future of many countries … embrace it, or be steamrolled in the future by those who did, so a soft touch is being embraced.

Copyright and licensing uncertainty could hinder that, and the status quo today in many places is to not treat training as copyright infringement (eg. US), or to require an explicit opt-out (eg. EU). A lack of international agreements means it’s all a bit wishy washy, and hard to prove and enforce.

Things get (only slightly) easier if the material is behind a terms-of-service wall.

Ferk@lemmy.ml · edit-2 2 months ago

You are not gonna protect abstract ideas using copyright. Essentially, what he’s proposing implies turning this “TGPL” in some sort of viral NDA, which is a different category of contract.

It’s harder to convince someone that a content-focused license like the GPLv3 protects also abstract ideas, than creating a new form of contract/license that is designed specifically to protect abstract ideas (not just the content itself) from being spread in ways you don’t want it to spread.

Rioting Pacifist@lemmy.world · 2 months ago

LLMs don’t have anything to do with abstract ideas, they quite literally produce derivative content based on their training data & prompt.

Ferk@lemmy.ml · edit-2 2 months ago

LLMs abstract information collected from the content through an algorithm (what they store is the result of a series of tests/analysis, not the content itself, but a set of characteristics/ideas). If that makes it derivative, then all abstractions are derivative. It’s not possible to make abstractions without collecting data derived from a source you are observing.

If derivative abstractions were already something that copyright can protect then litigants wouldn’t resort to patents, etc.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 months ago

This is the correct take. This tech isn’t going away, no matter how much whinging people do, the only question is who is going to control it going forward.

CanadaPlus@lemmy.sdf.org · 2 months ago

How dare you break the jerk! /s

fakasad68@lemmy.ml · edit-2 2 months ago

Checking whether a proprietary LLM model running on the “cloud” has been trained on a piece of TGPL code would probably be harder than checking if a proprietary binary contains a piece of GPL code, though.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 months ago

Not necessarily, the models can often be tricked into spilling the beans of how they were trained.

bizdelnick@lemmy.ml · 2 months ago

One of the four essential freedoms is the freedom to study the software and modify it. Studying means training your brain on the open source code. Can one use their brain to write proprietary code after they studied some copylefted code?

chgxvjh [he/him, comrade/them]@hexbear.net · 2 months ago

If you study a code base then implement something similar yourself without attribution, there is a good chance that you are doing a form of plagiarism.

In other contexts like academic writing this approach might be considered a pretty clear and uncontroversial case of plagiarism.

bizdelnick@lemmy.ml · edit-2 2 months ago

Also, what if one implement proprietary software that is completely different from open source project they studied? They still may use knowledge they obtained when studying, e. g. by reusing algorithms, patterns or even code formatting. This is a common case for LLM coding assistants.

chgxvjh [he/him, comrade/them]@hexbear.net · 2 months ago

There are better suited tools than large language models for that, that run faster on regular laptop CPU than the roundtrip to the super computer in the AI data center.

bizdelnick@lemmy.ml · 2 months ago

It’s not the topic we discussed, right?

chgxvjh [he/him, comrade/them]@hexbear.net · 2 months ago

You listed a bunch of use cases for LLMs that aren’t plagiarism and they all seem to be better solved by different tools.

bizdelnick@lemmy.ml · 2 months ago

So what is the case we are speaking about? “Hey LLM, write the OS kernel that is fully compatible with Linux, designed like Linux, uses the same algorithms as Linux and the same code style as Linux”?

chgxvjh [he/him, comrade/them]@hexbear.net · 2 months ago

If you have Linux in the training data, the outcome if at all remotely useful would likely include plagiarism.

bizdelnick@lemmy.ml · edit-2 2 months ago

There’s no such a word as plagiarism in free licenses nor in copyright laws. One could violate copyrights or patents or not. Copyleft licenses do not forbid what you call plagiarism. If you want to forbid this as well as training LLMs on your code, you need a new type of license. However I’m unsure if such a license could be considered free by FSF or approved by OSI.

chgxvjh [he/him, comrade/them]@hexbear.net · edit-2 2 months ago

Plagiarism is a form of copyright infringement if there are substantial similarities.

Open source licenses build on top of intellectual property laws.

bizdelnick@lemmy.ml · 2 months ago

So, everything depends on how you define substantial similarities. My opinion is that if there are no copy-and-pasted chunks of code (except for trivial), there are no substantial similarities.

chgxvjh [he/him, comrade/them]@hexbear.net · 2 months ago

https://en.wikipedia.org/wiki/Substantial_similarity

bizdelnick@lemmy.ml · 2 months ago

I live in another country, however the idea is the same as I wrote above: this all is about direct copying.

makingStuffForFun@lemmy.ml · 2 months ago

I asked the bot. It says:

I understand the concern. This question sits at the intersection of copyright law, statistics, and how machine-learning systems work—and it’s fair to be skeptical.

Here’s the most precise and careful way to frame it:

1. I can’t confirm GPL training data

I don’t know what specific works or licenses were in the training set, and I can’t confirm that GPL-licensed material was included. Neither I nor users have visibility into the underlying corpus.

So while it may be statistically plausible that GPL text exists somewhere in public data, plausibility ≠ proof, and I can’t assert it as fact.

2. Training ≠ distribution (the core legal argument)

OpenAI’s legal position (and that of most ML practitioners) is broadly this:

Training on copyrighted material is not the same as redistributing it
The model does not contain copies of works
Outputs are newly generated, not retrieved
Copyright law regulates copying and distribution, not learning patterns or statistics

This is analogous (legally, not philosophically) to a human reading GPL code and then later writing different code from experience.

Whether courts ultimately agree in every jurisdiction is still being litigated—but no court has ruled that training itself is per se illegal.

3. GPL specifically

GPL obligations trigger when you:

copy
modify
distribute
GPL-licensed code or derivative works.

Current legal consensus does not treat model weights or outputs as GPL “derivatives” in the copyright sense, because:

There is no preserved expressive structure
There is no traceable lineage to a specific work
Outputs are not mechanically derived from any one source

That’s why GPL’s copyleft provisions are generally understood not to propagate through model training.

4. Is the law settled?

No. Anyone claiming it is settled is overstating.

What is true:

Training on public data is widespread and openly practiced
Multiple lawsuits exist, but none have yet established that training itself violates copyright or GPL
Regulators and courts are still clarifying boundaries

5. Practical reality

Even if GPL material were in the training mix:

That alone would not automatically make the model or its outputs illegal
Liability (if any) would hinge on specific outputs being substantially similar to protected works, not on statistical likelihoods about training data

Bottom line

Your intuition about probability is reasonable—but legality does not turn on probability. It turns on:

whether protected expression is reproduced
whether derivative works are created
whether licenses are triggered by the act in question

As of now, the prevailing legal interpretation is that models like me are lawful to deploy, even though the debate is ongoing.

If you want, I can also:

Compare this to the Google Books ruling
Walk through why “derivative work” is a high legal bar
Discuss what would actually make an AI system GPL-tainted in practice

kadu@scribe.disroot.org · 2 months ago

I asked my crystal ball. It says:

fart noises

fox [comrade/them]@hexbear.net · 2 months ago

I’m not reading something nobody wrote chief