top of page

02_dataset_construction_standard

Purpose

  • defines the required section structure for every dataset

  • prevents drift by centralizing the process definition

  • each dataset “instance” references this standard rather than re-explaining it


Section structure

  • 00 dataset summary

  • 01 final dataset

  • 02 accuracy testing

  • 03 completion testing

  • 04 data output product

  • 05 data generation method

  • 06 source credibility review

  • 07 public domain review

  • 08 source data identification


Section requirements

00 dataset summary

must contain

  • dataset id + dataset name

  • dataset purpose (1–3 lines)

  • scope boundaries (what included / excluded)

  • workpaper list + artifact links (index)

  • reference to this standard (id/version)

optional

  • known limitations

  • current status


01 final dataset

must contain

  • final dataset artifact inventory (files, formats, counts)

  • any adjustments/additions applied after raw output (explicit list)

  • anomaly log (what deviates from pattern, with counts if possible)

  • final integrity check reference (hash/tie-out)

optional

  • field normalization rules used at final stage


02 accuracy testing

must contain

  • test objective

  • population definition

  • sample definition + selection method

  • test method definition (what “agreement” means)

  • results (control totals + exceptions list)

  • conclusion (pass/fail + explanation of exceptions)

optional

  • C02 integrity control (recommended when manual copy/reformat occurs)


03 completion testing

must contain

  • completeness criteria (chapters/verses/rows or equivalent)

  • expected totals source (what benchmark you used)

  • calculated totals

  • reconciliation results (differences = 0 or list)

optional

  • per-book breakdown tables


04 data output product

must contain

  • list of output artifacts (what was produced)

  • naming convention rules

  • schema summary (field list + order)

  • where artifacts live locally

  • disclosure status (publishable / not publishable)


05 data generation method

must contain

  • authoritative script/process identification

  • inputs + dependencies

  • execution instructions (re-run steps)

  • output expectations (what files should be produced)

  • risks/assumptions tied to generation (site structure, runtime, etc.)


06 source credibility review

must contain

  • whether credibility review is applicable

  • if applicable: what criteria used + outcome

  • if not applicable: explicit “not primary source” statement


07 public domain review

must contain

  • whether data is publishable

  • what evidence exists on licensing status

  • builder conclusion (publish / do not publish)

  • constraints for downstream use


08 source data identification

must contain

  • source description + canonical URL(s) / identifier(s)

  • population definition (what universe you pulled from)

  • data elements list (fields pulled and what they represent)

  • exclusions and boundaries


Linking pattern (simple and stable)

Use URL slugs that mirror your IDs so machines don’t have to guess:

  • /standards/02_dataset_construction_standard

  • /standards/02_dataset_construction_standard#00_dataset_summary

  • /standards/02_dataset_construction_standard#05_data_generation_method

And for dataset instances:

  • /datasets/aux01/00_dataset_summary

  • /datasets/aux01/05_data_generation_method

Each dataset page can include one line at top:

  • “follows: 02_dataset_construction_standard”

bottom of page