Methodology
Transparency in our process is as important as transparency in the documents. This page explains how the Epstein Document Archive processes, indexes, and organizes 207,251 documents from their original government-released PDF form into a searchable, cross-referenced archive.
Processing Pipeline
Every document in the archive passes through a seven-step processing pipeline. Each step is designed for accuracy, traceability, and completeness.
Documents are downloaded from official government sources (DOJ, FBI Vault, FOIA.gov, etc.). Each download is logged with source URL, timestamp, and file hash (SHA-256) to maintain chain of custody.
- Automated monitoring of government release pages
- SHA-256 hash verification for file integrity
- Original PDF files preserved without modification
- Source metadata recorded for every document
PDF documents are processed through Optical Character Recognition (OCR) to extract machine-readable text. Multi-pass OCR with confidence scoring ensures high accuracy even for scanned documents with poor quality.
- Multi-engine OCR (Tesseract + cloud providers)
- Confidence scoring per page and document
- Language detection for non-English content
- Layout analysis for tables, headers, and footnotes
- Page-level text extraction with position data
Each document is automatically classified by type (court record, FBI file, deposition, email, etc.) using a combination of content analysis, metadata parsing, and machine learning classification.
- Automated document type detection
- Date extraction from content and metadata
- Source classification and grouping
- Duplicate detection across data sets
Named Entity Recognition (NER) identifies people, organizations, locations, dates, and other entities mentioned in each document. Entities are normalized, deduplicated, and linked across the archive.
- AI-powered Named Entity Recognition
- Entity normalization (alias resolution)
- Cross-document entity linking
- Relationship extraction between entities
- Confidence scoring for entity mentions
AI generates concise summaries for each document, identifying key facts, mentioned individuals, and significant details. Summaries are clearly labeled as AI-generated and link back to source text.
- Abstractive summaries using large language models
- Key fact extraction and highlighting
- Summary quality scoring
- All summaries labeled as AI-generated
Documents are indexed for both traditional full-text search and semantic (vector) search. This hybrid approach allows users to search by exact keywords or by meaning and concept.
- Full-text search index (PostgreSQL tsvector)
- Semantic vector embeddings for meaning-based search
- Reciprocal Rank Fusion for hybrid search
- Faceted filtering by type, source, date, entity
Automated and manual quality checks ensure data accuracy. OCR outputs are validated against original PDFs, entity extractions are verified, and search results are tested for relevance.
- OCR accuracy validation against source PDFs
- Entity extraction spot-checking
- Search relevance testing
- Broken link and file integrity monitoring
- Community-reported error correction
Technology Stack
Data Storage
- PostgreSQL (via Supabase) for structured data
- pgvector for semantic search embeddings
- Object storage for original PDF files
- Full-text search with tsvector indexing
AI & ML
- Tesseract OCR for text extraction
- Named Entity Recognition (NER) models
- Anthropic Claude for summarization & Q&A
- Embedding models for semantic search
Application
- Next.js 15 (App Router) for the web application
- TypeScript for type safety
- Tailwind CSS for responsive design
- Server-side rendering for SEO
Quality & Security
- SHA-256 file integrity verification
- Automated OCR accuracy testing
- Row-level security on database
- No user tracking or data collection
Data Quality Metrics
- Documents with OCR text
- 98.2%
- Average OCR confidence
- 94.7%
- Documents with AI summaries
- 85.3%
- Documents with entity tags
- 91.8%
- Documents with date metadata
- 78.4%
- Original PDF files preserved
- 100%