Anthropic, Copyright, and the Fair Use Divide | The Briefing by Weintraub Tobin

A federal judge has ruled that training Claude AI on copyrighted books—even without a license—was transformative and protected under fair use. But storing millions of pirated books in a permanent internal library? That crossed the line. In this episode of The Briefing, Scott Hervey and Tara Sattler break down this nuanced opinion and what this ruling means for AI developers and copyright owners going forward. Watch this episode on YouTube.

Show Notes:

Scott: What happens when an artificial intelligence company trains its models on millions of books? Some purchased, some pirated. In a closely watched ruling, a federal judge held that training the AI was fair use, likening the process to how a human learns by reading. But keeping pirated copies of those books in a permanent digital library, well, that crossed the line. I’m Scott Hervey, a partner with the law firm of Weintraub Tobin, and I’m joined today by my partner and frequent Briefing contributor, Tara Sattler. We are going to break down the recent fair use ruling in the lawsuit over Claude AI, that’s Anthropic’s AI, and explore what it means for the future of AI training on today’s installment of the briefing. Tara, welcome back to The Briefing. Good to have you. Tara: Thanks, Scott. I always enjoy being here with you. Scott: Always enjoy having you. This one is a much-awaited decision because we have a number of these cases that are swirling around, challenging the process by which AI companies train their large language models. One of these cases involved the Anthropic AI Claude. Why don’t we jump jump into this one, Tara, maybe you could give us some of the background of this particular case. Tara: Absolutely. In 2021, Anthropic PVC, a startup founded by former OpenAI employees, set out to create a cutting-edge AI system, and that system would eventually become Claude. Like other large language models, Claude was trained on a vast amount of textual data, books, articles, websites, and more. But unlike many of its competitors, Anthropic took a controversial shortcut. Scott: Right. Instead of licensing books or building a clean data set, Anthropic downloaded millions of copyrighted works from pirate sites like Books 3, Library genius, and the pirate, Library Mirror. In total, Anthropic downloaded over seven million pirated books, including works by authors Andrea Barth, Charles Graber, and Kirk Wallace, Johnson. Anthropic also purchased millions of print books, scanned them, and then created a digital central library of searchable files. Tara: So the plaintiff sued, alleging that Anthropic infringed their copyrights by copying their works without permission. First, by downloading them from the pirate sites, and then by using them to train Claude, and finally, by keeping digital copies of the books in its internal library for potential future use. Scott: All right, so let’s now… So as we know, the lawsuit was filed, and Anthropic eventually moved for summary judgment based on fair use only. And in its ruling on Anthropic’s motion, Judge Al up of the Northern District of California issued a very detailed and nuanced opinion. The opinion splits Anthropic’s conduct into three key uses. The first is using the books to train the AI or the large language model, scanning and digitizing legally purchased print books, and thirdly, downloading and keeping pirated books in a permanent digital library. Each of these uses was evaluated under the Copyright Act’s Four-Factor Fair-Use Test. Tara: Right. Let’s walk through how the judge applied the four fair use factors in each use. For anyone who needs a refresher, here are the statutory factors for fair use under Section 107 of the Copyright Act. Scott: If you need a refresher, you’re not listening to this podcast often enough. Go ahead, Tara. Tara: Okay, so we’ll refresh anyway. First is the purpose and character of the use, including whether it is commercial and whether it’s transformative. The second is the nature of the copyrighted work. The third is the amount and substantiality of the portion of the copyrighted work that’s used. And the fourth is the effect the use upon the potential market for the original work, and that’s the economic analysis. Scott: And that one, as we know, has become more persuasive or more focused on since the Supreme Court case, since the Warhol Supreme Court case. All right, let’s focus on the first factor. Let’s focus on the first factor. Or let’s focus on the first use, which was the training of the large language models. So on the first use, the training of the Claude models using books. The court found that to be fair use. So it didn’t matter whether the books were the purchase books or they were the pirated books. The court found that the training on these books to be fair use and focused most heavily on the first factor. The court called this use spectacularly transformative. The court said, The purpose and character of using works to train LLMs was transformative. Spectacularly so. Like any reader aspiring to be a writer, Anthropics LLM trained upon works not to race ahead or replicate or supplant them, but to turn a hard corner and create something different. Tara: Right. So even if the AI memorized a lot of the underlying material, the court stressed that the training did not result in infringing output. Inputs. Users weren’t seeing verbatim excerpts from the plaintiff’s books. Scott: The court rejected the plaintiff’s arguments that just memorizing expressive elements was itself infringement. The court said, If somebody were to read all the modern-day classics because of their exceptional expression, memorize them, and then emulate a blend of their best writing, would that violate the Copyright Act? And the court said, Of course it would not. Tara: So the court sided with on the training issue, holding that using books to train Claude was spectacularly transformative. And the judge drew a direct analogy to human learning. The judge said, Everyone reads text, too, then writes new text. They may need to pay for getting their hands on a text in the first case, but to make anyone pay specifically for the use of a book each time they read it, each time they recall it from their memory, each time they later draw upon it when writing new things in new ways, would be unthinkable. Scott: The second and third factors, the nature of the work and the amount used, were considered less significant because of the high degree of transformation. And on the fourth factor, market harm, the judge said there was no evidence of substitution or competitive damage from the training process. So the result was that training was fair use. Tara: Okay, so now turning to the second use that the court analyzed, which is digitizing purchase books. Anthropic also purchased millions of print books and scanned them into searchable PDFs. The plaintiffs argued that changing the format from print to digital was itself infringement. Scott: Yeah, but the court disagreed. Because Anthropic had lawfully purchased these books, destroyed the physical copies and retained one digital copy in its place without redistributing that copy was fair use. The judge wrote, Here, every purchased print copy was copied in order to save storage space and enable a searchability as a digital copy. The print original was destroyed. Once replaced, one replaced the other. And there is no evidence that the new digital copy was shown, shared, or sold outside of the company. Tara: So this use was found to be narrowly transformative, not because of LLM training, but because the digitization made the library more efficient and searchable. And importantly, the court drew a direct line between this and the large scale copying involved in the Napster file sharing case, noting this use was even more clearly transformative than those in Texaco, Google, and Sony Betamax. More transformative than those uses rejected in Napster. So the result, digitizing purchase books is fair use. Scott: So let’s talk about the third use that the court analyzed, which was retaining the pirated copies of books in permanent libraries. And This is where Anthropic lost, fell short of establishing fair use. So Anthropic, as we know, downloaded more than seven million books from pirate sites and kept them in its internal library, even when it had no intention of using many of those books to train its models. The company argued that because some of those books were later used in training, which was fair use, keeping the pirated books was also fair use, was excusable. Tara: The judge also rejected that argument outright. There is no carve out, however, from the Copyright Act for AI Companies, is what the judge said. According to internal emails cited by the court, Anthropic’s founders were aware of the legal risks. The CEO described purchasing books as legal practice, business slog, and expressed a preference for simply downloading pirated copies. In total, the company downloaded books from pirate sources even after it had the option of purchasing or licensing them. Scott: Yeah, that’s the legal practice business log. It just fits that mantra of tech companies move fast, break things. But you better Be sure you’re right, otherwise you’re going to end up on the wrong side of a decision like here, right? And the judge was unambiguous in ruling on this point. Building a central library of works to be available for any number of further uses was itself the use for which Anthropic acquired these pirated copies and not a transformative one. He found that this use, building a centralized permanent library of pirated books, was not transformative and was not justified under fair use. In his words, the judge’s words, pirating copies to build a research library without paying for it and to retain copies should they prove useful for one thing or another was its own use and not a transformative one. Tara: The court was particularly troubled that Anthropic continued to keep pirated copies even after deciding they would not be used for training. They were acquired and retained as a central library of all the books in the world, is how the court phrased it. Scott: Yeah. And this wasn’t incidental copying. It was deliberate. And because the use failed the first factor, it wasn’t transformative, undermined the market for the works, which was the fourth factor, and involved complete verbatim copying, which was the third factor, the court found, as it probably must have had defined, that this was not protected. So the result, retaining pirated books was not fair use. Tara: Let’s talk for a little while now, Scott, about why this matters. It seems like this decision is among the most detailed judicial analysis yet on how copyright law intersects with AI training, and it seems like it sends a pretty clear message. Scott: I agree. Training an AI system using copyrighted materials, even expressive work like novels, can be fair use so long as the training is transformative and does not infringing outputs. That’s, I think, a very, very, very important point here that the output did not result in infringing… The outputs were not infringing. Tara: Yeah, I think you’re right. And I think another important point is that companies can’t justify how they acquire the data under the umbrella of fair use. So in the judge’s words, in this case, you can’t bless yourself by saying you have a research purpose and then go and take any textbook you want. Scott: So In practice, this means that AI companies and developers of AIs who want to train their LLMs will need to avoid using pirated materials, even if that use is internal only, clean up their training data sets, really, really think hard about licensing or partnering with publishers, and document which works were actually used in the training. Tara: I think you’re right, Scott. For copyright holders, the ruling really confirms that enforcement doesn’t depend on proving an infringing output. The mere act of copying and storing protected words, especially when done unlawfully, can itself be grounds for liability. Scott: Let’s talk about, I think, what this means for the AI industry and possibly for the broader industry of content owners. For developers of GAI systems, I think the decision cuts both ways, right? It’s both liberating and it’s cautionary. On the one hand, it confirms that training AI on copyrighted books, even expressive ones like novels, can be fair use if done correctly, responsibly, and without infringing outputs. Tara: But on the other hand, it sends a strong signal that the source of your training data really does matter. Pirated content, even if used for transformative purposes, won’t be shielded by fair use, especially if it’s retained for future uses. Scott: So expect this decision to push AI companies towards licensing deals with publishers. It also puts pressure on developers to document and clean up their training data sets. Everyone else is doing it. You know that thing you said to your kid, Would you jump off the roof if your buddy jumped off the roof? It’s not a defense, and it doesn’t hold up if your company is found to be sitting on a trove of pirated materials. Tara: Scott, it was really interesting talking about this one with you today. I know there was another recent ruling in the Meta case about its AI training. Hopefully, we can talk about that one soon. Scott: Oh, for sure. Yeah, I think it’ll be interesting to talk about that and then compare and contrast these two decisions and see if we’ve got some type of circuit split, some type of circuit split or some type of disagreement amongst the judges as to what aspect of LLM training or how the LLM are trained or what they’re trained on qualifies as fair use. So we’ll definitely cover that one next. Tara: Yeah, either a disagreement or an agreement, and maybe We will finally start to get some guidance. Scott: Well, that’s all for today’s episode of The Briefing. Thanks to Tara for joining me today. And thank you, the listener or viewer, for tuning in. We hope you found this episode informative and enjoyable. If you did, please remember to subscribe. Leave review and share this episode with your friends and colleagues. And if you have any questions about the topics we covered today, please leave us a comment.