Meta Used Enormous Database of Pirated Books to Train its AI, Court Documents Reveal

Internal messages reveal that the company made use of popular online shadow library LibGen to access millions of pirated books and papers.

Written by Aleksander Hougen (Co-Chief Editor)

Reviewed by Jackie Leavitt (Co-Chief Editor)

Last Updated: 28 Mar'25

Recently unsealed court documents consisting of internal Meta communications have confirmed that the company accessed millions of copyrighted books and papers through the well-known online shadow library LibGen.

What’s more, the documents make it clear that this was done with the express permission of the company’s CEO and founder Mark Zuckerberg (referred to as “MZ” in the messages), and contains numerous messages discussing the practice.

Meta had evidently initially tried to acquire the copyrighted material through the legal licensing process, but found that the process took too long. “They take like 4+ weeks to deliver data,” one senior manager had said in a message, included in exhibit C.

Enter LibGen

LibGen is the oldest online shadow library in the world, with a decentralized origin in the Soviet Union in the ’80s and ’90s before coagulating as “Library Genesis” (or LibGen for short) in 2008. It remains one of the largest depositories of pirated books to this day, boasting over 7.5 million books and 81 million research papers on its digital shelves.

The database has spawned numerous spinoffs over the years. Some of these — most notably ZLibrary, also used by Meta to train its AI — have become much more popular places to download pirated books, as the original LibGen is infamous for poorly organized metadata, duplicate copies and broken files.

AI’s Long Legal Row With Authors

The lawsuit we have to thank for all this treasure trove of information was filed in 2023 by three authors, Sarah Silverman, Richard Kadrey and Christopher Golden. Two years later, the group of plaintiffs has now grown to 12, including Junot Diaz and Ta-Nehisi Coates, though the lawsuit was partially dismissed in 2024.

This isn’t the first time that an artificial intelligence company came under fire from authors, as previous litigation against OpenAI showed that it also used LibGen to train its large-language model, though the company claims that this practice ended in 2021.

Confronted with evidence that the company used illegally pirated works, Meta’s lawyers argued that this usage falls under the fair use clause of copyright and that the suing authors can’t point to any “concrete injury.”

This legal strategy seems to have been the plan all along for Meta. As noted in the court documents, one team member said when the company was still exploring licensing options for books: “The problem is that people don’t realize that if we license one single book, we won’t be able to lean into fair use strategy.”

For what it’s worth, this strategy seems to be working. While a U.S. judge recently ruled against Meta’s motion to dismiss the case, he also noted that he didn’t favor their chances of proving a violation of DMCA.

The other potential legal problem for Meta as a result of these communications is the method in which the company acquired a copy of the LibGen database: torrenting.

Also included in court documents was when one Meta employee shared a link to a Quora article — titled “What is the probability of getting arrested for using torrents in the USA?” — in a group chat. They further wrote “not sure we can use meta’s IPs to load through torrents pirate content,” adding, “I think torrenting from a corporate laptop doesn’t feel right.”

AI vs Copyright and Privacy Laws

No matter how this specific legal battle ends, it’s unlikely to be the last time AI companies run into challenges based on existing copyright law.

It’s perhaps not surprising that companies are already moving to amend such legislation and carve out exemptions for AI training. Responding to the White House’s request for an “AI Action Plan”, companies like OpenAI, Meta and Google argued that legislation around copyright and data privacy should be adjusted to accommodate AI research and development.

No matter what one thinks is the future of AI, it’s clear that the time is rapidly approaching for real legislative guidelines on how AI will be handled going forward. Whether those guidelines will protect existing privacy and copyright protections remains to be seen.