Context
No formal schema validation exists for the pipeline's JSON output files. Upstream changes (e.g., images: [] -> images: null) and LLM output drift cause silent failures in downstream consumers (posters, Discord, website). Currently only extract-facts.py has manual field-presence checks.
Approach
Pure Python validators per script (no new dependencies like Pydantic -- respects the script independence pattern). Validation runs after LLM parse, before file write. Failures are logged and recorded in _metadata but don't block file write (degraded operation).
Scope
3A. scripts/etl/extract-facts.py -- Add validate_facts_schema()
~80-line function validating:
- Required top-level:
briefing_date (str), overall_summary (str), categories (dict)
- Optional:
key_facts (list[str]), open_questions (list[str])
- Tags:
themes (list), sentiment (dict with overall + context), story_type (list)
- Categories:
github_updates (dict), all others (list)
Call after tags are merged (~line 682), before writing _metadata.
Add _metadata.schema_validation = "passed"/"failed" and _metadata.schema_errors list.
3B. scripts/etl/generate-council-context.py -- Annotate existing validation
Lines 354-387 already validate the full nested schema. Just add:
_metadata.schema_validation = "passed" on success (~line 402)
_metadata.schema_validation = "failed" in ValueError catch (line 419)
- ~5 lines total
3C. scripts/etl/generate-daily-highlights.py -- Add validate_highlights_schema()
~50-line function validating:
- Required:
date (str), highlights (list)
- Each highlight:
headline (str), body (str), character (str), sources (list)
Call in generate_highlights() before return.
Files to modify
| File |
Est. lines changed |
scripts/etl/extract-facts.py |
~90 added |
scripts/etl/generate-council-context.py |
~5 added |
scripts/etl/generate-daily-highlights.py |
~60 added |
Verification
Run extract-facts against daily.json and check _metadata.schema_validation = "passed" in output.
Context
No formal schema validation exists for the pipeline's JSON output files. Upstream changes (e.g.,
images: [] -> images: null) and LLM output drift cause silent failures in downstream consumers (posters, Discord, website). Currently onlyextract-facts.pyhas manual field-presence checks.Approach
Pure Python validators per script (no new dependencies like Pydantic -- respects the script independence pattern). Validation runs after LLM parse, before file write. Failures are logged and recorded in
_metadatabut don't block file write (degraded operation).Scope
3A.
scripts/etl/extract-facts.py-- Addvalidate_facts_schema()~80-line function validating:
briefing_date(str),overall_summary(str),categories(dict)key_facts(list[str]),open_questions(list[str])themes(list),sentiment(dict withoverall+context),story_type(list)github_updates(dict), all others (list)Call after tags are merged (~line 682), before writing
_metadata.Add
_metadata.schema_validation= "passed"/"failed" and_metadata.schema_errorslist.3B.
scripts/etl/generate-council-context.py-- Annotate existing validationLines 354-387 already validate the full nested schema. Just add:
_metadata.schema_validation= "passed" on success (~line 402)_metadata.schema_validation= "failed" in ValueError catch (line 419)3C.
scripts/etl/generate-daily-highlights.py-- Addvalidate_highlights_schema()~50-line function validating:
date(str),highlights(list)headline(str),body(str),character(str),sources(list)Call in
generate_highlights()before return.Files to modify
scripts/etl/extract-facts.pyscripts/etl/generate-council-context.pyscripts/etl/generate-daily-highlights.pyVerification
Run extract-facts against daily.json and check
_metadata.schema_validation= "passed" in output.