If you commit code that was not written by yourself, double check that the license on that code permits import into the NetBSD source repository, and permits free distribution. Check with the author(s) of the code, make sure that they were the sole author of the code and verify with them that they did not copy any other code.
Code generated by a large language model or similar technology, such as GitHub/Microsoft's Copilot, OpenAI's ChatGPT, or Facebook/Meta's Code Llama, is presumed to be tainted code, and must not be committed without prior written approval by core.
No, it is not reasonable to presume code generated by any large language model is "tainted code." What does that even mean? It sounds like a Weird Al parody of the song "Tainted Love."
“Taint” has been a term of art in Open Source for decades. That you don’t know this reveals your ignorance, not any sort of cleverness.
LLMs regurgitate their training data. If they’re generating code, they’re not modeling the syntax of a language to solve a problem, they’re reproducing code they ingested, code that is covered by copyright. Just regurgitating that code via an LLM rather than directly from your editor’s clipboard does not somehow remove that copyright.
It’s clear you think you should be allowed to use LLMs to do whatever you want. Fortunately there are smarter people than you out there who recognize that there are situations where their use is not advised.
You've unfortunately been breaking the site guidelines quite a lot already, and not only in this thread. Would you mind reviewing them and using HN as intended? We'd be grateful:
The big difference between people reading code and LLMs reading code is that people have legal liability and LLMs do not. You can't sue an LLM for copyright infringement, and it's almost impossible for users to tell when it happens.
BTW in 2023 I watched ChatGPT spit out hundreds of lines of F# verbatim from my own GitHub. A lot of people had this experience with GitHub Copilot. "98.7% unique" is still a lot of infringement.
If you comission art from an artist who paints a modified copy of Warhol's work, the artist is liable (even if you keep that work private, for personal use).
If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.
I'm not going to argue about the merits of creativity here, or that someone putting a prompt into ChatGPT considers themselves an artist.
That's irrelevant. The work is created on OpenAI servers, by the LLMs hosted there, and is then distributed to whoever wrote the prompt.
Models run locally are distributed by whoever trained them.
If you train a model on whatever data you legally have access to, and produce something for yourself, it's one thing.
Distribution is where things start to get different.
> If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.
Let's distinguish two different scenarios here:
1) Your prompt is copyright-free, but the LLM produces a significant amount of copyrighted content verbatim. Then the LLM is liable, and you too are liable if you redistribute it.
2) Your prompt contains copyrighted data, and the LLM transforms it, and you distribute it. Then if the transformation is not sufficient, you are liable for redistributing it.
The second example is what I'm referring to, since the commercial LLM's are now very good about not reproducing copyrighted content verbatim. And yes, OpenAI is off the hook from everything I understand legally.
Your example of commissioning an artist is different from LLM's, because the artist is legally responsible for the product and is selling the result to you as a creative human work, whereas an LLM is a software tool and the company is selling access to it. So the better analogy is if you rent a Xerox copier to copy something by Warhol. Xerox is not liable if you try to redistribute that copy. But you are. So here, Xerox=OpenAI. They are not liable for your copyrighted inputs turning into copyrighted outputs.
>So the better analogy is if you rent a Xerox copier to copy something by Warhol
It isn't.
One analogy in that case would be going to a FedEx copy center and asking the technician to produce a bunch of copies of something.
They absolve themselves of liability by having you sign a waiver certifying that you have complete rights to the data that serves as input to the machine.
In case of LLMs, that includes the entire training set.
The most salient difference is that it's impossible to tell if an LLM is plagiarizing, whereas Xeroxing something implies specific intent to copy. It makes no sense to push liability onto LLM users.
Are you following the distinction between my scenarios (1) and (2)?
In scenario (1) the LLM is plagiarizing. But that's not the scenario we're discussing. And I already said, this is where the LLM is liable. Whether a user should be too is a different question.
But scenario (2) is what I'm discussing, as I already explained, and it's very possible to tell, because you yourself submitted the copyrighted content. All you need to do is look at whether the output is too similar to the input.
If there's some scenario where you input copyrighted material and it transforms it into different material that is also copyrighted by someone else... that is a pretty unlikely edge case.
One problem with this is that there isn't really a "current prompt" that completely describes the current source code; each source file is accompanied by a full chat log, including false starts and misunderstandings. It's sort of like reading a git history instead of the actual file.
> each source file is accompanied by a full chat log, including false starts and misunderstandings. It's sort of like reading a git history instead of the actual file.
My Git history contains links between the false starts and misunderstandings and the corrections, which then also include a paragraph on my this was a misunderstanding or false start. It is a lot better than just a single linear log from LLMs.
> each source file is accompanied by a full chat log, including false starts and misunderstandings. It's sort of like reading a git history instead of the actual file.
My Git history contains links between the false starts and misunderstandings and the corrections, which then also include a paragraph on my this was a misunderstanding or false start. It is a lot better than just a single linear log.
true, but that just means that's the problem to solve. probably the ideal architecture isn't possible right now. But I sorta imagine that you could later on take the full transcript of that conversation and expect any LLM to implement more or less the same thing based on it, so that eventually it becomes a full 'spec'.
And maybe there is a way to trim the parts out of it that are not needed... like to automatically produce an initial prompt which is equivalent to the results of a longer session, but is precise enough so as to not need clarification upon reprocessing it. Something like that? I'm not sure if that's something that already exists.
> But I sorta imagine that you could later on take the full transcript of that conversation and expect any LLM to implement more or less the same thing based on it
Why would you think this though? There are an infinite number of programs that can satisfy any non-trivial spec.
We have theoretical solutions to LLM non-determinism, we have no theoretical solutions to prompt instability especially when we can’t even measure what correct is.
yeah but all of the infinite programs are valid if they satisfy the spec (well, within reason). That's kinda the point. Implementation details like how the code is structured or what language it's in are swept under the rug, akin to how today you don't really care what register layout the compiler chooses for some code.
There has never been a non trivial
program in the history of the world that could just “sweep all the implementation details under the rug”.
Compilers use rigorous modeling to guarantee semantic equality and that is only possible because they are translating between formal languages.
A natural language spec can never be precise enough to specify all possible observable behaviors, so your bot swarm trying to satisfy the spec is guaranteed to constantly change observable behaviors.
This gets exposed to users and churn, jank, and workflow breaking bugs.
I am a jazz guitarist and am sympathetic to this comment: the way I tune my guitar these days is hitting an E tuning fork, playing a particular E7 chord, and deciding if it sounds good:
e —0–
B —0–
G —7–
D —6–
A —7–
E —0–
Learned it from Jimmy Bruno. I despise digital tuners. However it is worth noting: a properly-tuned guitar will never be able to play a “barbershop seventh,” which hits the natural harmonic dominant 7th and is so flat compared to TET that it’s really almost a 6th. The chord itself sounds more bittersweet and less “funky” than a TET dominant 7th. OTOH the TET chord is an essential part of modern blues-influenced music: being “out of tune” makes the chord sharp and strong, almost like a blue cheese being “moldy.” So I’m not beaten up about the limitations, it’s just worth keeping in mind: no instrument can beat a group of human voices.
In general your ears do not hear these little arithmetical games around mismatched harmonies. They hear things like “this chord sounds warm and a little sad, this one is bright and fun.”
With 12 of the strings on a sitar having equal (thin) diameter, but different lengths so they can be tuned to the 12 notes in the scale, these are also unplayed strings which contribute to the sound by resonating underneath the main course of strings which are the ones fretted and manually played on.
That's so endearing I guess that's why they call them sympathetic strings ;)
I think the most salient point about Factorio here is that its CPU-side native core was largely hammered out by 2018, most of the development since then has been in Lua or GPU-side. The devs could be quite confident their code didn't have any unhandled null pointers. That's not really the case for Chromium or (God help us) WebKit.
I imagine making a buggy and unmaintainable version could be done quickly, sure, if you don't mind your documents being killed by a thousand small typesetting cuts. TeX is incredibly complicated for good reasons, people should read Knuth's book.
The reason TeX is written in a 1984 dialect of Pascal is that the typesetting bugs have been solved in a completely specified language; it is much easier to write a transpiler for Pascal->C than to rewrite TeX. Asking an LLM to rewrite it in the language-du-jour is a huge cost for very little benefit.
BTW it has been so depressing in the last few months to see LLM-generated projects make claims about performance/accuracy, but there is no benchmarking code on Github and the "thousands of tests" are all useless happy paths. I am sure we will see some grifter claim that Claude rewrote TeX and I am sure dozens of credulous HN users will take it seriously. But we won't see a useful rewrite. It'll be resume-oriented slop like that dishonest Mathematica-in-Rust project we saw last week.
> it is much easier to write a transpiler for Pascal->C than to rewrite TeX. Asking an LLM to rewrite it in the language-du-jour ...
I thought that the combination of the Pascal and Java versions[1] of TeX would be sufficient guidance to produce another language/implementation.
> is a huge cost for very little benefit
A greenfield Java implementation with an MIT license would have been useful[2] for rendering TeX inside of my desktop Markdown editor[3]. Instead, I had to rename all the Java source files to abide by the NTSPL license terms (or GPLv2, which is viral).
> A greenfield Java implementation with an MIT license would have been useful[2] for rendering TeX inside of my desktop Markdown editor[3]. Instead, I had to rename all the Java source files to abide by the NTSPL license terms (or GPLv2, which is viral).
The source files make it look like DANTE owns the copyright, so you could try asking them to relicence it. Both Philip Taylor and Hans Hagen were involved in the leadership of NTS, and both are still active, so if they are okay with it, then DANTE would hopefully agree to relicence it.
> then DANTE would hopefully agree to relicence it.
In Feb 2023, when I emailed Hans about changing licenses, he wrote back:
> We decided to stick with the GNU (GLP) license. It's not like anyone is going to check in detail what happens with NTS after all these years. We just wanted to add the option for GPLv2. We're not going into endless debates about licences, which are always a sensitive topic in the tex community.
As another commenter points out, this sort of exists in the broader sense of "interactive fiction." There's lots of options for scaffolding an LLM into something like a storytelling toy.
But I truly believe an AI-generated equivalent of Zork or Lost Pig is decades away. The "knowing what it is talking about problem" is not even close to being solved, no matter how impressive coding agents have become. It is simply too easy to sabotage clever game design with accidentally adversarial prompting.
More generally: LLMs are still unfunny except in cases of clear plagiarism. This also extends to making fun games.
FWIW it's not just about money, it's about controlling creative work. E.g. Radiohead really does not want ICE to use their music for fascist propaganda, at any cost.
I really don't like how the discussion on HN always ignores the ways copyright protects individual expression as a fundamental right. Instead we're STEM dorks, focusing on how getting rid of copyright protection lets us increase content volume at the entertainment factory.