EDA

Methodology

Transparency in our process is as important as transparency in the documents. This page explains how the Epstein Document Archive processes, indexes, and organizes 207,251 documents from their original government-released PDF form into a searchable, cross-referenced archive.

Processing Pipeline

Every document in the archive passes through a seven-step processing pipeline. Each step is designed for accuracy, traceability, and completeness.

Step 1
Document Acquisition

Documents are downloaded from official government sources (DOJ, FBI Vault, FOIA.gov, etc.). Each download is logged with source URL, timestamp, and file hash (SHA-256) to maintain chain of custody.

  • Automated monitoring of government release pages
  • SHA-256 hash verification for file integrity
  • Original PDF files preserved without modification
  • Source metadata recorded for every document
Step 2
OCR & Text Extraction

PDF documents are processed through Optical Character Recognition (OCR) to extract machine-readable text. Multi-pass OCR with confidence scoring ensures high accuracy even for scanned documents with poor quality.

  • Multi-engine OCR (Tesseract + cloud providers)
  • Confidence scoring per page and document
  • Language detection for non-English content
  • Layout analysis for tables, headers, and footnotes
  • Page-level text extraction with position data
Step 3
Document Classification

Each document is automatically classified by type (court record, FBI file, deposition, email, etc.) using a combination of content analysis, metadata parsing, and machine learning classification.

  • Automated document type detection
  • Date extraction from content and metadata
  • Source classification and grouping
  • Duplicate detection across data sets
Step 4
Entity Recognition (NER)

Named Entity Recognition (NER) identifies people, organizations, locations, dates, and other entities mentioned in each document. Entities are normalized, deduplicated, and linked across the archive.

  • AI-powered Named Entity Recognition
  • Entity normalization (alias resolution)
  • Cross-document entity linking
  • Relationship extraction between entities
  • Confidence scoring for entity mentions
Step 5
AI Summarization

AI generates concise summaries for each document, identifying key facts, mentioned individuals, and significant details. Summaries are clearly labeled as AI-generated and link back to source text.

  • Abstractive summaries using large language models
  • Key fact extraction and highlighting
  • Summary quality scoring
  • All summaries labeled as AI-generated
Step 6
Search Indexing

Documents are indexed for both traditional full-text search and semantic (vector) search. This hybrid approach allows users to search by exact keywords or by meaning and concept.

  • Full-text search index (PostgreSQL tsvector)
  • Semantic vector embeddings for meaning-based search
  • Reciprocal Rank Fusion for hybrid search
  • Faceted filtering by type, source, date, entity
Step 7
Quality Assurance

Automated and manual quality checks ensure data accuracy. OCR outputs are validated against original PDFs, entity extractions are verified, and search results are tested for relevance.

  • OCR accuracy validation against source PDFs
  • Entity extraction spot-checking
  • Search relevance testing
  • Broken link and file integrity monitoring
  • Community-reported error correction

Technology Stack

Data Storage

  • PostgreSQL (via Supabase) for structured data
  • pgvector for semantic search embeddings
  • Object storage for original PDF files
  • Full-text search with tsvector indexing

AI & ML

  • Tesseract OCR for text extraction
  • Named Entity Recognition (NER) models
  • Anthropic Claude for summarization & Q&A
  • Embedding models for semantic search

Application

  • Next.js 15 (App Router) for the web application
  • TypeScript for type safety
  • Tailwind CSS for responsive design
  • Server-side rendering for SEO

Quality & Security

  • SHA-256 file integrity verification
  • Automated OCR accuracy testing
  • Row-level security on database
  • No user tracking or data collection

Data Quality Metrics

Documents with OCR text
98.2%
Average OCR confidence
94.7%
Documents with AI summaries
85.3%
Documents with entity tags
91.8%
Documents with date metadata
78.4%
Original PDF files preserved
100%

Frequently Asked Questions

What OCR technology is used?
The archive uses multi-engine OCR combining Tesseract (open source) with cloud-based OCR services. Multiple passes are run on each document, and results are compared for accuracy. Confidence scores are recorded per page.
How accurate is the text extraction?
OCR accuracy varies by document quality. High-quality typed documents typically achieve 99%+ accuracy. Handwritten notes, poor-quality scans, and faxed documents may have lower accuracy. OCR confidence scores are available for each document.
How are entities identified and linked?
Named Entity Recognition (NER) uses AI models to identify mentions of people, organizations, and locations. Entities are normalized (e.g., 'JE', 'Jeffrey', and 'Epstein' are linked to the same person) and cross-referenced across all documents.
Are AI-generated summaries reliable?
AI summaries are generated from the document text and are designed to capture key facts and mentioned individuals. They are clearly labeled as AI-generated and should be verified against the original documents. Summaries do not include information from outside the document.
How does hybrid search work?
Hybrid search combines traditional full-text keyword search with semantic vector search. Full-text search matches exact words and phrases. Semantic search understands meaning and concepts, finding relevant documents even when exact keywords don't match. Results are merged using Reciprocal Rank Fusion.