— Methodology

How we built this

1. Source acquisition

We hit the war.gov manifest at /Portals/1/Interactive/2026/UFO/uap-csv.csv daily and diff it against our index. New rows trigger downloads (PDF / image direct, video via the DVIDS API). We snapshot every CSV we receive so you can audit what we ingested.

2. OCR + entity extraction

Every PDF page is rasterized at 200 DPI via PyMuPDF, then sent to OpenAI's gpt-4o-mini with a structured-output schema asking for: faithful transcription, redaction count, page-quality grade, and named entities (people, places, dates, agencies, case numbers). Pages flagged low-confidence are re-processed with gpt-4o.

3. Document-level summarization

After OCR completes, the full text is summarized by GPT-4o with strict tone guidelines: calm, factual, journalistic. The model produces a one-sentence headline, a short summary, a long markdown explainer, key entities, cross-references with verbatim evidence, and a tone classification (official_record, redaction_heavy, speculative_content, low_confidence).

4. Search

We use Postgres full-text search with tsvector generated columns on documents (title, agency, location, description) and ai_summaries (headline, short summary, long summary). Page-level OCR is also indexed. Ranking uses ts_rank_cd with weighted scores; snippets via ts_headline.

5. Ask-this-document

Each document has an “ask the AI” chat. The model receives only that single document's OCR'd text plus your question. It is instructed to refuse speculation, quote passages where possible, and answer “the document does not say that” when the answer isn't supported.

6. Editorial overrides

When AI summaries get something wrong (rare but possible — particularly on heavily redacted documents), our admin panel lets editors override headlines and long summaries. The original AI output is preserved for transparency.

7. What can fail

OCR confidence on the worst 1947 carbon copies can drop below 70%. We surface the confidence score on every page.
Heavily redacted documents may yield uninformative summaries. We tag these as redaction_heavy.
Date inference from filenames is heuristic. Always defer to the document's internal dates if they conflict.
Geocoding uses a small lookup table for common UAP incident locations; many records have no precise coordinates.