What Data Trained This?

Where did AI learn? Your data is in here somewhere.

Large language models are trained on massive datasets scraped from the internet. But "the internet" isn't abstract—it's copyrighted books, your deleted Reddit posts, code you wrote, videos you made, and websites you published.

Click each data source to explore what was collected, how it was used, and why it matters.

The Scale of Data Collection

GPT-3 was trained on ~45TB of text data (570GB compressed). That's roughly 300 billion words—equivalent to 1 million novels.

GPT-4's training data is undisclosed, but estimated to be significantly larger. OpenAI has acknowledged using "publicly available data" and data licensed from third parties, but won't specify sources or composition.

LLaMA (Meta) used: 67% Common Crawl, 15% C4 (cleaned web data), 4.5% GitHub, 4.5% Wikipedia, 2.5% books, 2.5% ArXiv papers, 2% Stack Exchange, 2% other sources. This is one of the few models with disclosed data composition.

The Pile (popular open-source training dataset): 825GB including Books3, OpenWebText2, Stack Exchange, PubMed, ArXiv, GitHub, YouTube subtitles, and more. Widely used despite copyright concerns.

What This Means

📝 Your Content

If you've posted online—blogs, forums, code, videos, articles—it's likely in a training dataset. Deleting it later doesn't remove it from datasets already collected.

⚖️ Copyright Questions

Courts are still deciding whether training AI on copyrighted content is "fair use." AI companies argue it's transformative; creators argue it's theft. Lawsuits are ongoing.

🔍 No Opt-Out (Yet)

You can't retroactively remove your data from training sets. Some platforms now offer robots.txt to block AI crawlers, but most data was collected before these tools existed.

💰 Value Extraction

Creators, authors, and programmers produced this content. AI companies collected it for free and built billion-dollar products. No compensation, no attribution, no consent.

How to Check If Your Data Was Used

For Books & Articles:

Search for your work in datasets like Books3 or check if your publisher is suing AI companies. Authors Guild maintains a list of confirmed datasets containing copyrighted books.

Authors Guild →

For Code:

If you have public GitHub repositories, assume they were used. GitHub Copilot's lawsuit revealed that code from public repos was used for training.

GitHub Copilot Litigation Info →

For Web Content:

Search Common Crawl's index for your domain. If your site was publicly accessible, it's likely been crawled.

Common Crawl Search →

For Social Media:

Reddit posts (via Pushshift), Twitter posts (via datasets sold to researchers), and public Facebook posts have all been scraped. Assume anything public was collected.

Learn More:

The Pile: Documentation of Training Dataset →

See exactly what's in a major training dataset

Have I Been Trained? Search Tool →

Search if your images were used to train AI art models

Questions to Ask

• What specific datasets were used to train this model?

• Was my content included in the training data?

• Did the AI company obtain permission or licenses for this data?

• Can I opt out or request removal from training datasets?

• If my copyrighted work was used, do I have legal recourse?

Share This Experience

Help others discover AI ethics through interactive experiences