Methodology

Transparency in our process is as important as transparency in the documents. This page explains how the Epstein Document Archive processes, indexes, and organizes 207,251 documents from their original government-released PDF form into a searchable, cross-referenced archive.

Processing Pipeline

Every document in the archive passes through a seven-step processing pipeline. Each step is designed for accuracy, traceability, and completeness.

Step 1

Document Acquisition

Documents are downloaded from official government sources (DOJ, FBI Vault, FOIA.gov, etc.). Each download is logged with source URL, timestamp, and file hash (SHA-256) to maintain chain of custody.

Automated monitoring of government release pages
SHA-256 hash verification for file integrity
Original PDF files preserved without modification
Source metadata recorded for every document

Step 2

OCR & Text Extraction

PDF documents are processed through Optical Character Recognition (OCR) to extract machine-readable text. Multi-pass OCR with confidence scoring ensures high accuracy even for scanned documents with poor quality.

Multi-engine OCR (Tesseract + cloud providers)
Confidence scoring per page and document
Language detection for non-English content
Layout analysis for tables, headers, and footnotes
Page-level text extraction with position data

Step 3

Document Classification

Each document is automatically classified by type (court record, FBI file, deposition, email, etc.) using a combination of content analysis, metadata parsing, and machine learning classification.

Automated document type detection
Date extraction from content and metadata
Source classification and grouping
Duplicate detection across data sets

Step 4

Entity Recognition (NER)

Named Entity Recognition (NER) identifies people, organizations, locations, dates, and other entities mentioned in each document. Entities are normalized, deduplicated, and linked across the archive.

AI-powered Named Entity Recognition
Entity normalization (alias resolution)
Cross-document entity linking
Relationship extraction between entities
Confidence scoring for entity mentions

Step 5

AI Summarization

AI generates concise summaries for each document, identifying key facts, mentioned individuals, and significant details. Summaries are clearly labeled as AI-generated and link back to source text.

Abstractive summaries using large language models
Key fact extraction and highlighting
Summary quality scoring
All summaries labeled as AI-generated

Step 6

Search Indexing

Documents are indexed for both traditional full-text search and semantic (vector) search. This hybrid approach allows users to search by exact keywords or by meaning and concept.

Full-text search index (PostgreSQL tsvector)
Semantic vector embeddings for meaning-based search
Reciprocal Rank Fusion for hybrid search
Faceted filtering by type, source, date, entity

Step 7

Quality Assurance

Automated and manual quality checks ensure data accuracy. OCR outputs are validated against original PDFs, entity extractions are verified, and search results are tested for relevance.

OCR accuracy validation against source PDFs
Entity extraction spot-checking
Search relevance testing
Broken link and file integrity monitoring
Community-reported error correction

Technology Stack

Data Storage

PostgreSQL (via Supabase) for structured data
pgvector for semantic search embeddings
Object storage for original PDF files
Full-text search with tsvector indexing

AI & ML

Tesseract OCR for text extraction
Named Entity Recognition (NER) models
Anthropic Claude for summarization & Q&A
Embedding models for semantic search

Application

Next.js 15 (App Router) for the web application
TypeScript for type safety
Tailwind CSS for responsive design
Server-side rendering for SEO

Quality & Security

SHA-256 file integrity verification
Automated OCR accuracy testing
Row-level security on database
No user tracking or data collection

Data Quality Metrics

Documents with OCR text: 98.2%
Average OCR confidence: 94.7%
Documents with AI summaries: 85.3%
Documents with entity tags: 91.8%
Documents with date metadata: 78.4%
Original PDF files preserved: 100%

Frequently Asked Questions

What OCR technology is used?

The archive uses multi-engine OCR combining Tesseract (open source) with cloud-based OCR services. Multiple passes are run on each document, and results are compared for accuracy. Confidence scores are recorded per page.

How accurate is the text extraction?

OCR accuracy varies by document quality. High-quality typed documents typically achieve 99%+ accuracy. Handwritten notes, poor-quality scans, and faxed documents may have lower accuracy. OCR confidence scores are available for each document.

How are entities identified and linked?

Named Entity Recognition (NER) uses AI models to identify mentions of people, organizations, and locations. Entities are normalized (e.g., 'JE', 'Jeffrey', and 'Epstein' are linked to the same person) and cross-referenced across all documents.

Are AI-generated summaries reliable?

AI summaries are generated from the document text and are designed to capture key facts and mentioned individuals. They are clearly labeled as AI-generated and should be verified against the original documents. Summaries do not include information from outside the document.

How does hybrid search work?

Hybrid search combines traditional full-text keyword search with semantic vector search. Full-text search matches exact words and phrases. Semantic search understands meaning and concepts, finding relevant documents even when exact keywords don't match. Results are merged using Reciprocal Rank Fusion.