Want to build something like this? Here's how we indexed 109,000+ documents from the DOJ Epstein Files release.
EpsteIN by cfinke is the extraction tool that made this possible. It handles the DOJ's paginated PDF releases and extracts text content.
Direct search against Meilisearch. No auth required for read operations.
curl -X POST "https://epstein.dugganusa.com/indexes/epstein_files/search" \
-H "Content-Type: application/json" \
-d '{"q": "Prince Andrew", "limit": 20}'
{
"hits": [
{
"efta_id": "EFTA00022136",
"content": "...",
"dataset": "dataset3",
"pages": 5,
"people": ["prince_andrew", "virginia_giuffre"],
"locations": ["new_york"]
}
],
"estimatedTotalHits": 228,
"processingTimeMs": 12
}
Breaking eggs: This index exists because someone decided completeness mattered more than perfect formatting. If you're building something similar, don't let perfect be the enemy of done. Index what you have, improve later.
We indexed:
The source documents are public record from justice.gov. Our index adds searchability and entity extraction on top.
Questions? Contact us.