One of the four essential freedoms is the freedom to study the software and modify it. Studying means training your brain on the open source code. Can one use their brain to write proprietary code after they studied some copylefted code?
If you study a code base then implement something similar yourself without attribution, there is a good chance that you are doing a form of plagiarism.
In other contexts like academic writing this approach might be considered a pretty clear and uncontroversial case of plagiarism.
Also, what if one implement proprietary software that is completely different from open source project they studied? They still may use knowledge they obtained when studying, e. g. by reusing algorithms, patterns or even code formatting. This is a common case for LLM coding assistants.
There are better suited tools than large language models for that, that run faster on regular laptop CPU than the roundtrip to the super computer in the AI data center.
So what is the case we are speaking about? “Hey LLM, write the OS kernel that is fully compatible with Linux, designed like Linux, uses the same algorithms as Linux and the same code style as Linux”?
There’s no such a word as plagiarism in free licenses nor in copyright laws. One could violate copyrights or patents or not. Copyleft licenses do not forbid what you call plagiarism. If you want to forbid this as well as training LLMs on your code, you need a new type of license. However I’m unsure if such a license could be considered free by FSF or approved by OSI.
So, everything depends on how you define substantial similarities. My opinion is that if there are no copy-and-pasted chunks of code (except for trivial), there are no substantial similarities.
One of the four essential freedoms is the freedom to study the software and modify it. Studying means training your brain on the open source code. Can one use their brain to write proprietary code after they studied some copylefted code?
If you study a code base then implement something similar yourself without attribution, there is a good chance that you are doing a form of plagiarism.
In other contexts like academic writing this approach might be considered a pretty clear and uncontroversial case of plagiarism.
Also, what if one implement proprietary software that is completely different from open source project they studied? They still may use knowledge they obtained when studying, e. g. by reusing algorithms, patterns or even code formatting. This is a common case for LLM coding assistants.
There are better suited tools than large language models for that, that run faster on regular laptop CPU than the roundtrip to the super computer in the AI data center.
It’s not the topic we discussed, right?
You listed a bunch of use cases for LLMs that aren’t plagiarism and they all seem to be better solved by different tools.
So what is the case we are speaking about? “Hey LLM, write the OS kernel that is fully compatible with Linux, designed like Linux, uses the same algorithms as Linux and the same code style as Linux”?
If you have Linux in the training data, the outcome if at all remotely useful would likely include plagiarism.
Are there similar cases in the wild?
There’s no such a word as plagiarism in free licenses nor in copyright laws. One could violate copyrights or patents or not. Copyleft licenses do not forbid what you call plagiarism. If you want to forbid this as well as training LLMs on your code, you need a new type of license. However I’m unsure if such a license could be considered free by FSF or approved by OSI.
Plagiarism is a form of copyright infringement if there are substantial similarities.
Open source licenses build on top of intellectual property laws.
So, everything depends on how you define substantial similarities. My opinion is that if there are no copy-and-pasted chunks of code (except for trivial), there are no substantial similarities.
https://en.wikipedia.org/wiki/Substantial_similarity
I live in another country, however the idea is the same as I wrote above: this all is about direct copying.