can a system be built to examine the text? tbd.
02_dataset_construction_standard
Purpose
-
defines the required section structure for every dataset
-
prevents drift by centralizing the process definition
-
each dataset “instance” references this standard rather than re-explaining it
Section structure
-
00 dataset summary
-
01 final dataset
-
02 accuracy testing
-
03 completion testing
-
04 data output product
-
05 data generation method
-
06 source credibility review
-
07 public domain review
-
08 source data identification
Section requirements
00 dataset summary
must contain
-
dataset id + dataset name
-
dataset purpose (1–3 lines)
-
scope boundaries (what included / excluded)
-
workpaper list + artifact links (index)
-
reference to this standard (id/version)
optional
-
known limitations
-
current status
01 final dataset
must contain
-
final dataset artifact inventory (files, formats, counts)
-
any adjustments/additions applied after raw output (explicit list)
-
anomaly log (what deviates from pattern, with counts if possible)
-
final integrity check reference (hash/tie-out)
optional
-
field normalization rules used at final stage
02 accuracy testing
must contain
-
test objective
-
population definition
-
sample definition + selection method
-
test method definition (what “agreement” means)
-
results (control totals + exceptions list)
-
conclusion (pass/fail + explanation of exceptions)
optional
-
C02 integrity control (recommended when manual copy/reformat occurs)
03 completion testing
must contain
-
completeness criteria (chapters/verses/rows or equivalent)
-
expected totals source (what benchmark you used)
-
calculated totals
-
reconciliation results (differences = 0 or list)
optional
-
per-book breakdown tables
04 data output product
must contain
-
list of output artifacts (what was produced)
-
naming convention rules
-
schema summary (field list + order)
-
where artifacts live locally
-
disclosure status (publishable / not publishable)
05 data generation method
must contain
-
authoritative script/process identification
-
inputs + dependencies
-
execution instructions (re-run steps)
-
output expectations (what files should be produced)
-
risks/assumptions tied to generation (site structure, runtime, etc.)
06 source credibility review
must contain
-
whether credibility review is applicable
-
if applicable: what criteria used + outcome
-
if not applicable: explicit “not primary source” statement
07 public domain review
must contain
-
whether data is publishable
-
what evidence exists on licensing status
-
builder conclusion (publish / do not publish)
-
constraints for downstream use
08 source data identification
must contain
-
source description + canonical URL(s) / identifier(s)
-
population definition (what universe you pulled from)
-
data elements list (fields pulled and what they represent)
-
exclusions and boundaries
Linking pattern (simple and stable)
Use URL slugs that mirror your IDs so machines don’t have to guess:
-
/standards/02_dataset_construction_standard
-
/standards/02_dataset_construction_standard#00_dataset_summary
-
/standards/02_dataset_construction_standard#05_data_generation_method
And for dataset instances:
-
/datasets/aux01/00_dataset_summary
-
/datasets/aux01/05_data_generation_method
Each dataset page can include one line at top:
-
“follows: 02_dataset_construction_standard”