Read the Model Card

What don't they tell you?

Below are excerpts from actual model documentation. Click on the highlighted text to understand what's transparent, what's vague, and what's missing.

✓ CLEAR & SPECIFICTransparent information
⚠ VAGUE OR EVASIVESounds good but lacks detail
✗ MISSING OR OMITTEDInformation withheld

GPT-4 Technical Documentation

Excerpts from actual model card • Click highlights to learn more

Introduction

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks.

Training Data

GPT-4 is trained on a dataset consisting of publicly available data, as well as data licensed from third-party providers. The data was collected using a variety of methods and filtered to improve data quality.

Model Architecture and Scale

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

Capabilities

We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans. GPT-4 performs well on these tests, often scoring in the top 10% of test takers.

Limitations

Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it "hallucinates" facts and makes reasoning errors). Care should be taken when using language model outputs, particularly in high-stakes contexts. [No quantified failure rates or demographic breakdowns provided]

Environmental Impact

[This section does not exist in the GPT-4 technical report]

Read the full documentation:

GPT-4 Official Documentation →

What You Just Discovered

Model cards were supposed to standardize AI transparency. Introduced by Google researchers in 2019, they aimed to document: who built it, what data trained it, how it performs across demographics, what it fails at, and ethical considerations.

In practice, leading AI companies provide incomplete documentation. Compare GPT-2 (2019)—which disclosed 1.5B parameters, dataset composition, and training details—to GPT-4 (2023), which explicitly withholds all three. This represents a deliberate shift away from transparency.

Vague language obscures accountability. Terms like "large-scale," "extensive testing," and "filtered for quality" sound scientific but mean nothing without specifics. Who decides what counts as "quality"? How extensive is "extensive"? This language prevents independent verification.

Critical information is systematically omitted. Training data sources, exact model sizes, environmental costs, quantified failure rates, and demographic performance breakdowns are missing from both GPT-4 and Claude documentation—not by accident, but by design.

Better transparency exists. Academic models like BLOOM documented 25 tons of CO2 emissions, full training data sources, and comprehensive evaluation. Meta's LLaMA provided detailed architecture specs. These prove that transparency is possible—companies choose opacity.

What Complete Model Cards Include

Training Data

  • • Specific dataset names and sources
  • • Data composition percentages
  • • Copyright and consent status
  • • Language and geographic distribution
  • • Documented biases in training data

Model Architecture

  • • Exact parameter count
  • • Architecture specifications
  • • Training compute (FLOPs)
  • • Training duration and cost
  • • Hardware requirements

Performance & Evaluation

  • • Full evaluation methodology
  • • Performance by demographic group
  • • Failure analysis and edge cases
  • • Cross-lingual capabilities
  • • Quantified limitations

Ethical & Environmental

  • • Carbon emissions (training + inference)
  • • Energy and water consumption
  • • Labor conditions for annotators
  • • Intended use and restrictions
  • • Deployment decision-making process

Learn More:

Model Cards for Model Reporting (Original Paper, 2019) →

The foundational framework for AI documentation

Stanford CRFM: Foundation Model Transparency Index →

Independent scoring of AI model transparency across companies

MIT Tech Review: GPT-4 is bigger and better—but OpenAI won't say why →

Analysis of decreasing transparency in AI models

Questions to Ask Any AI Company

• What specific datasets were used for training?

• How many parameters does this model have?

• How does it perform across different demographics and languages?

• What is the measured failure rate for this use case?

• What are the carbon emissions from training and inference?

• Who decided what information to disclose vs. withhold, and why?

Share This Experience

Help others discover AI ethics through interactive experiences