In a significant development for the intersection of artificial intelligence and copyright law, Adobe Inc. has been hit with a proposed class-action lawsuit accusing the company of using pirated copyrighted books to train its SlimLM small language model. The suit, filed on December 17, 2025, in California federal court by Oregon-based author Elizabeth Lyon, marks the first major copyright infringement claim against Adobe related to its AI training practices.
Lyon, a non-fiction writer known for instructional books on novel marketing and writing guides, alleges that Adobe unlawfully copied and used her works—along with those of potentially thousands of other authors—without permission or compensation. The complaint seeks unspecified damages and aims to represent a class of all affected copyright owners.
This case adds Adobe to a growing list of tech giants facing scrutiny over AI data sourcing, highlighting ongoing tensions between rapid AI innovation and creators’ rights.
The Lawsuit: Core Allegations
At the heart of the complaint is Adobe’s SlimLM, a series of small language models (SLMs) optimized for on-device document assistance tasks, such as summarization, question answering, and suggestions on mobile devices. Released in research form in late 2024, SlimLM was designed to run efficiently offline, addressing privacy concerns by processing sensitive documents locally without cloud dependency.
According to an arXiv paper published by Adobe researchers in November 2024, SlimLM was pre-trained on the SlimPajama-627B dataset—a 627 billion-token, open-source collection described as “deduplicated, multi-corpora.” The dataset, developed by Cerebras in 2023, was intended as a cleaned and improved version of earlier open-source efforts.
However, the lawsuit contends that SlimPajama-627B is derived from the RedPajama dataset, which itself incorporates the infamous Books3 collection. Books3, a trove of approximately 191,000-196,000 pirated books scraped from shadow libraries (illegal online repositories like Bibliotik and Library Genesis), has become a lightning rod in AI copyright disputes.
The complaint argues: “The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3)… Thus, because it is a derivative copy of the RedPajama dataset, SlimPajama contains the Books3 dataset, including the copyrighted works of Plaintiff and the Class members.”
Lyon claims her books were among those illegally included, constituting direct copyright infringement through unauthorized reproduction and use in training.
Background on SlimLM and Adobe’s AI Strategy
Adobe has positioned itself as a leader in “responsible AI,” particularly with its flagship generative tool Firefly, launched in 2023. Unlike many competitors, Firefly was explicitly trained on licensed Adobe Stock images, openly licensed content, and public domain material—earning praise for being “commercially safe” and offering indemnity to users against copyright claims.
This ethical stance was part of Adobe’s broader AI principles: accountability, responsibility, and transparency. The company has advocated for opt-out mechanisms (like “Do Not Train” tags) and emphasized not training on customer data.
SlimLM, however, appears to deviate from this approach. Developed by Adobe researchers in collaboration with academics from Auburn University and Georgia Tech, it focused on efficiency for mobile deployment—models ranging from 125 million to 7 billion parameters, tested on devices like the Samsung Galaxy S24.
While Firefly targeted creative image generation, SlimLM aimed at practical text-based document tools, potentially integrating into products like Acrobat. The use of an open-source dataset like SlimPajama may have been seen as standard in research circles, but it exposed Adobe to risks tied to Books3’s tainted origins.
As of December 18, 2025, Adobe has not publicly responded to the lawsuit.
The Infamous Books3 Dataset: A Brief History
Books3 emerged around 2020-2021 as part of The Pile, an 800GB open-source training corpus compiled by EleutherAI, a nonprofit AI research group. Created by developer Shawn Presser, Books3 was assembled by downloading books from pirate sites using scripts inspired by the late activist Aaron Swartz.
Presser intended it to democratize AI training, allowing open-source projects to compete with closed models from companies like OpenAI. However, its pirated nature quickly drew fire.
- 2023 Takedown: Danish anti-piracy group Rights Alliance issued a DMCA notice, leading host The Eye to remove it. Alternate links persisted, but official access was curtailed.
- Contents: Included works by Stephen King, Margaret Atwood, Sarah Silverman, and thousands more—spanning genres from fiction to non-fiction.
Books3’s use has fueled multiple lawsuits:
- Authors (including Silverman) sued Meta for training Llama models on it.
- Similar claims against OpenAI, Anthropic, NVIDIA, Apple, and others.
- Investigations revealed widespread adoption in early AI development due to its size and availability.
The dataset symbolizes the AI industry’s early “move fast and break things” ethos regarding data, often prioritizing scale over legality.
Broader Wave of AI Copyright Litigation
The Adobe suit joins dozens of cases challenging AI training practices:
- Anthropic Settlement (2025): In a landmark resolution, Anthropic agreed to pay $1.5 billion to settle claims of using pirated books (including from Books3/The Pile) to train Claude. The deal, approved in September 2025, provided ~$3,000 per infringed work for ~500,000 books— the largest copyright settlement in U.S. history. It covered only past piracy, leaving future training open but signaling high costs for illicit data.
- Meta: Admitted using portions of Books3; faces ongoing suits.
- Apple: Sued in 2025 for alleged Books3 use in Apple Intelligence.
- Other Defendants: OpenAI, Microsoft, Salesforce, and more.
Courts have split on key issues:
- Training on legally acquired copies may qualify as “fair use” (transformative, non-competitive).
- But piracy is “inherently infringing,” as ruled in Anthropic’s case.
Outcomes could reshape AI development, pushing toward licensed datasets or robust opt-outs.
Implications for the AI Industry and Creators
This lawsuit underscores a pivotal moment:
- For Companies: Reliance on open-source datasets carries hidden risks if upstream sources are tainted. Even “deduplicated” versions like SlimPajama may inherit liabilities.
- For Creators: Validates concerns over uncompensated use, potentially leading to more settlements or licensing models.
- Policy and Ethics: Adobe’s own blogs emphasize ethical training, yet this case highlights challenges in text models versus their image-focused Firefly. Industry-wide, it may accelerate shifts to transparent, permission-based data.
As the case progresses—starting with class certification—the tech world watches. Will Adobe settle like Anthropic, or fight on fair use grounds? The resolution could influence how future AI models balance innovation with intellectual property respect.
In an era where AI promises to transform creativity, cases like this remind us that progress must not come at the expense of those who create the very content fueling it.
