— Methodology
How we built this
1. Source acquisition
We hit the war.gov manifest at /Portals/1/Interactive/2026/UFO/uap-csv.csv daily and diff it against our index. New rows trigger downloads (PDF / image direct, video via the DVIDS API). We snapshot every CSV we receive so you can audit what we ingested.
2. OCR + entity extraction
Every PDF page is rasterized at 200 DPI via PyMuPDF, then sent to OpenAI's gpt-4o-mini with a structured-output schema asking for: faithful transcription, redaction count, page-quality grade, and named entities (people, places, dates, agencies, case numbers). Pages flagged low-confidence are re-processed with gpt-4o.
3. Document-level summarization
After OCR completes, the full text is summarized by GPT-4o with strict tone guidelines: calm, factual, journalistic. The model produces a one-sentence headline, a short summary, a long markdown explainer, key entities, cross-references with verbatim evidence, and a tone classification (official_record, redaction_heavy, speculative_content, low_confidence).
4. Search
We use Postgres full-text search with tsvector generated columns on documents (title, agency, location, description) and ai_summaries (headline, short summary, long summary). Page-level OCR is also indexed. Ranking uses ts_rank_cd with weighted scores; snippets via ts_headline.
5. Ask-this-document
Each document has an “ask the AI” chat. The model receives only that single document's OCR'd text plus your question. It is instructed to refuse speculation, quote passages where possible, and answer “the document does not say that” when the answer isn't supported.
6. Editorial overrides
When AI summaries get something wrong (rare but possible — particularly on heavily redacted documents), our admin panel lets editors override headlines and long summaries. The original AI output is preserved for transparency.
7. What can fail
- OCR confidence on the worst 1947 carbon copies can drop below 70%. We surface the confidence score on every page.
- Heavily redacted documents may yield uninformative summaries. We tag these as
redaction_heavy. - Date inference from filenames is heuristic. Always defer to the document's internal dates if they conflict.
- Geocoding uses a small lookup table for common UAP incident locations; many records have no precise coordinates.