The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.
Is anyone working on this? I'd be very interested to discuss.
Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.
BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]
[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")
Also the maintainer's ground-up rewrite argument is very flimsy when they used chardet's test-data and freely admit to:
> I've been the primary maintainer and contributor to this project for >12 years
> I have had extensive exposure to the original codebase: I've been maintaining it for over a decade. A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.
> I reviewed, tested, and iterated on every piece of the result using Claude.
> I was deeply involved in designing, reviewing, and iterating on every aspect of it.
There was a paper that proposed a content based hashing mask for traning
The idea is you have some window size, maybe 32 tokens. Hash it into a seed for a pseudo random number generator. Generate random numbers in the range 0..1 for each token in the window. Compare this number against a threshold. Don't count the loss for any tokens with a rng value higher than the threshold.
It learns well enough because you get the gist of reading the meaning of something when the occasional word is missing, especially if you are learning the same thing expressed many ways.
It can't learn verbatim however. Anything that it fills in will be semantically similar, but different enough to get cause any direct quoting onto another path after just a few words.
> you get the gist of reading the meaning of something when the occasional word is missing,
I think it's more subtle than that. IIUC the tokens were all present for the purpose of computing the output and the score is based on the output. It's only the weight update where some of the tokens get ignored. So the learning is lossy but the inference driving the learning is not.
Rather than a book that's missing words it's more like a person with a minor learning disability that prevents him from recalling anything perfectly.
However it occurs to me that data augmentation could easily break the scheme if care isn't taken.
Yeah, it's a bit hard to describe what it happening, because the process doesn't really have a human analogue.
People have a difficult enough time dealing with how loss reduction learning is or isn't 'seeing' the data. Selectively removing things from the loss while sill feeding it all the data takes the non-intuitive situation one layer deeper.
That's partially why I described the hash & masking process. I understand it from a formulaic approach but I don't really feel like I have have a good handle of what is happening semantically. It's like thinking in 5D, you can do the calculations but it still feels like your brain is not equipped to deal with what it means.
"... If AI-generated code cannot be copyrighted (as the courts suggest) ".
So, Supreme Court has said that. AI-produced code can not be copyrighted. (Am I right?). Then who's to blame if AI produces code large portions of which already exist coded and copyrigted by humans (or corporations).
I assume it goes something like this:
A) If you distribute code produced by AI, YOU cannot claim copyright to it.
B) If you distribute code produced by AI, YOU CAN be held liable for
distributing it.
SCOTUS hasn't ruled on any AI copyright cases yet. But they've said in Feist v Rural (1991) that copyright requires a minimum creative spark. The US Copyright Office maintains that human authorship is required for copyright, and the 9th Circuit in 2019 explicitly agreed with the law that a non-human animal cannot hold any copyright.
Functionally speaking, AI is viewed as any machine tool. Using, say, Photoshop to draw an image doesn't make that image lose copyright, but nor does it imbue the resulting image with copyright. It's the creativity of the human use of the tool (or lack thereof) that creates copyright.
Whether or not AI-generated output a) infringes the copyright of its training data and b) if so, if it is fair use is not yet settled. There are several pending cases asking this question, and I don't think any of them have reached the appeals court stage yet, much less SCOTUS. But to be honest, there's a lot of evidence of LLMs being able to regurgitate training inputs verbatim that they're capable of infringing copyright (and a few cases have already found infringement in such scenarios), and given the 2023 Warhol decision, arguing that they're fair use is a very steep claim indeed.
The lack thereof (of human use). Prompts are not copyrightable thus the output also - not. Besides retelling a story is fair use, right? Otherwise we should ban all generative AI and prepare for Dune/Foundation future. But we not there, and we perhaps never going to be.
So the LLM training first needs to be settled, then we talk whether retelling a whole software package infringes anyone's right. And even if it does, there are no laws in place to chase it.
In practice the output of the LLM does not tell what the prompt was, and the output varies randomly, so it is unlikely you would be sued for copying the prompt. And in fact you would not know what the prompt, if any, was for the original unless you copied the prompt from somewhere.
The Supreme Court has not ruled on this issue. An appeal of a lower court's ruling on this issue was appealed to the Supreme Court but the Supreme Court declined to accept the case.
The Supreme Court has "original jurisdiction" over some types of cases, which means if someone brings such a case to them they have to accept it and rule on it, and they have "discretionary jurisdiction" over many more types of cases, which means if someone brings one of those they can choose whether or not they have to accept it. AI copyright cases are discretionary jurisdiction cases.
You generally cannot reliable infer what the Supreme Court thinks of the merits of the case when they decline to accept it, because they are often thinking big picture and longer term.
They might think a particular ruling is needed, but the particular case being appealed is not a good case to make that ruling on. They tend to want cases where the important issue is not tangled up in many other things, and where multiple lower appeals courts have hashed out the arguments pro and con.
When the Supreme Court declines the result is that the law in each part of the country where an appeals court has ruled on the issue is whatever that appeals court ruled. In parts of the country where no appeals court has ruled, it will be decided when an appeal reaches their appeals courts.
If appeals courts in different areas go in different directions, the Supreme Court will then be much more likely to accept an appeal from one of those in order to make the law uniform.
IANAL but I was under the impression that Supreme Court ruling was very specific to the AI itself copyrighting its own produced code. Once a human is involved, it gets a lot more complicated and rests on whether the human's contribution was substantial enough to make it copyrightable under their person.
A fun exercise: When Supreme Court has not ruled on an open legal question of interest, let's ask AI what would be a likely ruling by Supreme Court.
I think SCOTUS might in fact use AI to get a set of possible interpretations of the law, before they come up with their decision. AI might give them good reasons for pros and cons.
Hopefully. If they are smart they know that everybody can be wrong, therefore it is good to hear differing opinions and argumentation from multiple sources, in important matters.
Copyright does not cover ideas. Only specific executions of ideas. So unless it's a line-by-line copy (unlikely) there is no recourse for someone to sue for a re-execution/reimplementation of an idea.
It's not "my model." If someone paraphrases a poem, and publishes that paraphrase, the original author will not be able to sue. (Or rather, they can sue, but will almost certainly lose.) There is a body of legal precedent for each category of work you can imagine, and each has come to have its own criteria for what the threshold is for being derivative vs a unique re-expression; but I am confident from how that has played out and from the fact that it is well accepted that code tends to be comprised of only so many patterns, that a codebase that is reverse engineered based on prompting alone will not be considered a derivative work.
It's obviously an opinion. But I'm confident enough in it, as are, say, Lovable and such companies, that I/they are willing to concretely operate on the hunch that that is how it will play out in court if ever the hand was forced.
You've likely paid attention to the litigation here. Regardless of what remains to be litigated, the training in and of itself has already been deemed fair use (and transformative) by Alsup.
Further, you know that ideas are not protected by copyright. The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.
If it were the case that the LLM ingested the code and regurgitated it (as would be the premise of highlighting the training data provenance), that similarity would be much higher. That is not the case.
You're right, I've followed the litigation closely. I've advocated for years that "training is fair use" and I'm generally an anti-IP hawk who DEFENDS copyright/trademark cases. Only recently have I started to concede the issue might have more nuance than "all training is fair use, hard stop." And I still think Judge Alsup got it right.
That said, even if model training is fair use, model output can still be infringing. There would be a strong case, for example, if the end user guides the LLM to create works in a way that copies another work or mimics an author or artist's style. This case clearly isn't that. On the similarity at issue here, I haven't personally compared. I hope you're right.
I think “strong case” is probably reliant on a few points on the output side, and would have to be more than just author/artists style.
Style itself would be very hard to deem infringement, for obvious reasons (idea) - I think it’s much more likely an issue when a character has derivative elements (e.g., iron man, spider man esque features), and where the users prompt had explicit references to those characters (intent)
All that said, even then, on the artistic side I think it would come down to the same analysis that would apply to traditional media - AI is just a vehicle that introduces some novel risks.
Music might be more risky given the litigious nature of the industry.
Code? It’s going to be hard to claim infringement with dramatically different implementations, barring patent coverage.
> The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.
Can I use one AI agent to write detailed tests based on disassembled Windows, and another to write code that passes those same function-level tests? If so, I'm about to relicense Windows 11 - eat my shorts, ReactOS!
I didn't catch that on first read, but I see why you'd say that. LLMs are ridiculous in the constant usage "it's not X it's Y" -- It's in almost every response from Opus 4.5. "It's not X it's Y" is ruined for regular writing.
I'm also skeptical of anything that claims to reliably detect AI writing. FWIW, I plugged the comment into Pangram Labs, which claims to be the most reliably and seems to have worked well before. It categorized the comment as 100% human written with medium confidence.
Stated more cynically, many platforms have an interest in attention hijacking. Done well, agents' 'laser focused attention' could help users avoid wasting time (wandering attention) and money (impulse buys). This is a good thing, even if it dings revenue of some existing platforms. If a company's business model is impulse buying and ad revenue (this isn't eBay IMO), then good riddance.
This is a really interesting point, and you're right to say it's complicated. I'm sort of an anti-IP hawk (I actually rep defendants in IP cases) and personally agree with it. But the US Copyright Office's position on GenAI supports the opposite view:
> Nor do we agree that AI training is inherently transformative because it is like human learning. To begin with, the analogy rests on a faulty premise, as fair use does not excuse all human acts done for the purpose of learning. A student could not rely on fair use to copy all the books at the library to facilitate personal education; rather, they would have to purchase or borrow a copy that was lawfully acquired, typically through a sale or license. Copyright law should not afford greater latitude for copying simply because it is done by a computer. Moreover, AI learning is different from human learning in ways that are material to the copyright analysis. Humans retain only imperfect impressions of the works they have experienced, filtered through their own unique personalities, histories, memories, and worldviews. Generative AI training involves the creation of perfect copies with the ability to analyze works nearly instantaneously. The result is a model that can create at superhuman speed and scale. In the words of Professor Robert Brauneis, “Generative model training transcends the human limitations that underlie the structure of the exclusive rights.”[0]
I disagree with the Copyright Office here, but ofc they're the Copyright Office and I could be wrong. More broadly I'm struggling with how to permit and incentivize creation of powerful generative models while not screwing creators in the process. There are startups and other efforts trying to address this through novel licensing, etc., but AFAIK there's no great solution. I'm also cautiously optimistic that there will be some decentralized and/or federated options. It's complicated indeed.
The headline misses the nuance that there will be a trial on Anthropic's gathering "pirated copies to create Anthropic's central library and the resulting damages." But the court got it right IMO that "the use of the [copyrighted] books at issue to train Claude and its precursors was exceedingly transformative and was a fair use..."
PBS Spacetime did an interesting video on DCQE, but it tripped me up trying to fully understand what was happening: https://www.youtube.com/watch?v=8ORLN_KwAgs&t=601s ... Later Sabine Hossenfelder did a video debunking the proposition that DCQE somehow showed that the past was being rewritten. https://www.youtube.com/watch?v=RQv5CVELG3U And Matt from PBS Spacetime acknowledged she was right in this respectful comment:
> Sabine, this is amazing. You are, as usual, 100% right. The delayed choice quantum eraser is a prime example of over-mystification of quantum mechanics, even WITHIN the field of quantum mechanics! I (Matt) was guilty of embracing the quantum woo in that episode 5 years ago. Since then I've obsessed over this family of experiments and my thinking shifted quite a bit.
I’m an IP lawyer & AI dev: my first reaction was, “hmm there are trademark issues here.” From a US perspective: “Perplexity” certainly CAN be a trademark, and the company has applied for one—to my knowledge it’s still pending. If the term was merely “descriptive” of the service provided, like “American Airlines”, then the company would need to show that the term has acquired distinctiveness: ie, that purchasers associate the term with that specific company. But perplexity is probably more than merely descriptive here.
Assuming that they have a valid trademark, the issue becomes whether there is a likelihood of confusion between Perplexity and Perplexica. That is a fact-specific, multifactor test, which I’ll spare you. But there could be arguments both ways IMO
HN is so incredible. The topic can be just about anything and there’s someone here with just the right expertise and/or set of skills to share their two pennies. The current topic is AI and IP law and here comes someone who’s an IP lawyer and AI engineer. I truly love this place.
Sorry to hear this, but congrats to Bob for a life well lived and building a brand that made quality products. We have their muesli multiple times a week, their farro as well, and this morning our kids loved Valentine's Day pancakes made from their mix. Thanks Bob
The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.
Is anyone working on this? I'd be very interested to discuss.
Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.
BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]
[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")
reply