- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
LLM are the future, but we must still learn to use it correctly. The energy problem depends mainly on 2 things, the use of fossil energy and the abuse of AI including it without need in everything, because the hype, as data logging tool for Big Brother or biased influencers.
You don’t need a 4x4 8 cylinder Pick-up to go 2km to the store to buy bread.
You don’t need a 4x4 8 cylinder Pick-up to go 2km to the store to buy bread.
In the U.S., yes.
It’s simply another case where we have amazing technologies but we lack the right ways to use them, that’s what our culture does: creating amazing techs that can solve lots of human problems and then discarding the part that actually solves a problem unless it’s also profitable for the individual.
It literally is a problem of people wanting to submit other people for power games, that’s not how all societies work, but that’s a foundation for ours, but we’re playing this game so much that we almost broke the console (planet earth and our own bodies health).
It’s an anthropological problem, not a technological one.
This is the point, We have big advances in tech, physic, medicine. science…thanks to AI. But the first use we give it is to create memes, reading BS chats, and build it in fridges, or worst, build it in weapons to kill others.
We should reject them.
LLMs are tools. They’re not replacements for human creativity. They are not reliable sources of truth. They are interesting tools and toys that you can play with.
So have fun and play with them.
See, it’s not fun for the planet.
Well-said. LLMs do have some useful applications, but they cannot replace human creativity nor are they omniscient.
Instead of trying to prevent LLM training on our code, we should be demanding that the models themselves be freed.
You can demand it but it’s not an pragmatic demand as you claim. Open weight models aren’t equivalent to free software, they are much closer proprietary gratis software. Usually you don’t even get access to the training software and the training data and even if you did it would take millions of capital to reproduce them.
But the resulting models must be freed. Any model trained on this code must have its weights released under a compatible copyleft license.
You can put into your license whatever you want but for it to be enforceable it needs to grant licensee additional rights they don’t already have without the license. The theory under which tech companies appear to be operating is that they don’t in fact need your permission to include your code into their datasets.
block the crawlers, withdraw from centralized forges like GitHub
Moving away from github has become a good idea since Microsoft has purchased it years ago.
You kind of need to block crawlers because of you host large projects they will just max out your servers resources, CPU or bandwidth whatever is the bottleneck.
Github is blocking crawlers too, they have restricted rate limits a lot recently. If you are using nix/nixos which fetches a lot of repositories from github you often can’t even finish a build without github credentials nowadays with how rate limited github has become.
The problem is not the algorithm. The problem is the way they’re trained. If I made a dataset from sources whose copyright holders exercise their IP rights and then train an LLM on it, I’d probably go to jail or just kill myself (or default on my debts to the holders) if they sue for damages.
I support FOSS LLMs like Qwen just because of that. China doesn’t care about IP bullshit and their open source models are great.
Checking whether a proprietary LLM model running on the “cloud” has been trained on a piece of TGPL code would probably be harder than checking if a proprietary binary contains a piece of GPL code, though.
Seems like the easiest fix is to consider the produce of LLMs to be derivative products of the training data.
No need for a new license, if you’re training code on GPL code the code produced by LLMs is GPL.
You are not gonna protect abstract ideas using copyright. Essentially, what he’s proposing implies turning this “TGPL” in some sort of viral NDA, which is a different category of contract.
It’s harder to convince someone that a content-focused license like the GPLv3 protects also abstract ideas, than creating a new form of contract/license that is designed specifically to protect abstract ideas (not just the content itself) from being spread in ways you don’t want it to spread.
LLMs don’t have anything to do with abstract ideas, they quite literally produce derivative content based on their training data & prompt.
LLMs abstract information collected from the content through an algorithm (what they store is the result of a series of tests/analysis, not the content itself, but a set of characteristics/ideas). If that’s derivative, then ALL abstract ideas are derivative. It’s not possible to make abstractions without collecting data derived from a source you are observing.
If derivative abstractions were already something that copyright can protect then litigants wouldn’t have had to create patents, etc.
Let me know if you convince any lawmakers, and I’ll show you some lawmakers about to be invited to expensive “business” trips and lunches by lobbyists.
The same can be said of the approach described in the article, the “GPLv4” would be useless unless the resulting weights are considered a derivative product.
A paint manufacturer can’t claim copyright on paintings made using that paint.
Indeed. I suspect it would need to be framed around national security and national interests, to have any realistic chance of success. AI is being seen as a necessity for the future of many countries … embrace it, or be steamrolled in the future by those who did, so a soft touch is being embraced.
Copyright and licensing uncertainty could hinder that, and the status quo today in many places is to not treat training as copyright infringement (eg. US), or to require an explicit opt-out (eg. EU). A lack of international agreements means it’s all a bit wishy washy, and hard to prove and enforce.
Things get (only slightly) easier if the material is behind a terms-of-service wall.
One of the four essential freedoms is the freedom to study the software and modify it. Studying means training your brain on the open source code. Can one use their brain to write proprietary code after they studied some copylefted code?
If you study a code base then implement something similar yourself without attribution, there is a good chance that you are doing a form of plagiarism.
In other contexts like academic writing this approach might be considered a pretty clear and uncontroversial case of plagiarism.
Also, what if one implement proprietary software that is completely different from open source project they studied? They still may use knowledge they obtained when studying, e. g. by reusing algorithms, patterns or even code formatting. This is a common case for LLM coding assistants.
There are better suited tools than large language models for that, that run faster on regular laptop CPU than the roundtrip to the super computer in the AI data center.
It’s not the topic we discussed, right?
You listed a bunch of use cases for LLMs that aren’t plagiarism and they all seem to be better solved by different tools.
So what is the case we are speaking about? “Hey LLM, write the OS kernel that is fully compatible with Linux, designed like Linux, uses the same algorithms as Linux and the same code style as Linux”?
If you have Linux in the training data, the outcome if at all remotely useful would likely include plagiarism.
There’s no such a word as plagiarism in free licenses nor in copyright laws. One could violate copyrights or patents or not. Copyleft licenses do not forbid what you call plagiarism. If you want to forbid this as well as training LLMs on your code, you need a new type of license. However I’m unsure if such a license could be considered free by FSF or approved by OSI.
Plagiarism is a form of copyright infringement if there are substantial similarities.
Open source licenses build on top of intellectual property laws.
So, everything depends on how you define substantial similarities. My opinion is that if there are no copy-and-pasted chunks of code (except for trivial), there are no substantial similarities.
I live in another country, however the idea is the same as I wrote above: this all is about direct copying.
I asked the bot. It says:
I understand the concern. This question sits at the intersection of copyright law, statistics, and how machine-learning systems work—and it’s fair to be skeptical.
Here’s the most precise and careful way to frame it:
1. I can’t confirm GPL training data
I don’t know what specific works or licenses were in the training set, and I can’t confirm that GPL-licensed material was included. Neither I nor users have visibility into the underlying corpus.
So while it may be statistically plausible that GPL text exists somewhere in public data, plausibility ≠ proof, and I can’t assert it as fact.
2. Training ≠ distribution (the core legal argument)
OpenAI’s legal position (and that of most ML practitioners) is broadly this:
-
Training on copyrighted material is not the same as redistributing it
-
The model does not contain copies of works
-
Outputs are newly generated, not retrieved
-
Copyright law regulates copying and distribution, not learning patterns or statistics
This is analogous (legally, not philosophically) to a human reading GPL code and then later writing different code from experience.
Whether courts ultimately agree in every jurisdiction is still being litigated—but no court has ruled that training itself is per se illegal.
3. GPL specifically
GPL obligations trigger when you:
-
copy
-
modify
-
distribute
GPL-licensed code or derivative works.
Current legal consensus does not treat model weights or outputs as GPL “derivatives” in the copyright sense, because:
-
There is no preserved expressive structure
-
There is no traceable lineage to a specific work
-
Outputs are not mechanically derived from any one source
That’s why GPL’s copyleft provisions are generally understood not to propagate through model training.
4. Is the law settled?
No. Anyone claiming it is settled is overstating.
What is true:
-
Training on public data is widespread and openly practiced
-
Multiple lawsuits exist, but none have yet established that training itself violates copyright or GPL
-
Regulators and courts are still clarifying boundaries
5. Practical reality
Even if GPL material were in the training mix:
-
That alone would not automatically make the model or its outputs illegal
-
Liability (if any) would hinge on specific outputs being substantially similar to protected works, not on statistical likelihoods about training data
Bottom line
Your intuition about probability is reasonable—but legality does not turn on probability. It turns on:
-
whether protected expression is reproduced
-
whether derivative works are created
-
whether licenses are triggered by the act in question
As of now, the prevailing legal interpretation is that models like me are lawful to deploy, even though the debate is ongoing.
If you want, I can also:
-
Compare this to the Google Books ruling
-
Walk through why “derivative work” is a high legal bar
-
Discuss what would actually make an AI system GPL-tainted in practice
I’m not reading something nobody wrote chief
I asked my crystal ball. It says:
fart noises
-











