Validation

Fixed public benchmark suite and generated reports

pharmacoml uses benchmark-gated development. The default hybrid workflow is evaluated on a fixed public suite, and the benchmark runner now writes a reusable Markdown/CSV/JSON report bundle for GitHub and release documentation.

Run the benchmark suite

PYTHONPATH=. python benchmarks/run_public_benchmarks.py --check

By default this command prints the benchmark summary and writes report artifacts to benchmarks/reports/fixed_public/.

Generated report artifacts

File	Purpose
`public_benchmark_report.md`	Human-readable benchmark report for GitHub/docs
`public_benchmark_summary.csv`	Variant-level metrics summary
`public_benchmark_details.csv`	Per-case performance details
`public_benchmark_report.json`	Structured machine-readable report bundle

Current fixed release suite

The current benchmark-backed default workflow has exact agreement on the real/public PK cases and several targeted synthetic checks, with the remaining gaps concentrated in the hardest collinearity-heavy synthetic scenarios.

Dataset	Target / published covariates	Current agreement	Source / data
`pheno`	`CL/WGT`, `VC/WGT`, `VC/ASPHYXIA`	Exact	Pharmpy example model/data
`eleveld_union`	`A1V2`, `AGE`, `HGT`, `M1F2`, `PMA`, `TECH`, `WGT`	Exact	Wahlquist public propofol benchmark repo
`ggpmx_theophylline`	`CL/AGE0`, `CL/SEX_1`, `CL/STUD_2`, `CL/WT0`, `V/WT0`	Exact	ggPMX theophylline example files
`high_shrinkage_user_input`	`CL/WT`	Exact	Generated in package
`age_pma_distinct`	`CL/AGE`, `CL/PMA`, `CL/WT`	Exact	Generated in package
`interaction_xor_screening`	`CL/COPD`, `CL/SMK`, `CL/COPD__xor__SMK`	Exact	Generated in package
`asiimwe_correlated_small_n`	`CL/CRCL`, `CL/SEX`, `CL/WT`, `V/WT`	Partial	Generated in package
`shapcov_collinear`	`CL/AGE`, `CL/CRCL`, `CL/WT`, `V/WT`	Partial	Generated in package

How to read the benchmark output

Primary summary

Compares configuration variants like baseline, RFE, shrinkage-aware, and combined workflows across the fixed suite.

Per-case details

Shows precision, recall, F1, and FDR for each benchmark case, which is what you use to understand where the workflow helps and where it is still conservative.

The benchmark gate is used to choose defaults. New features should only become default behavior if they improve or preserve the pinned public baseline.

Useful command variants

# write report bundle to the default location
PYTHONPATH=. python benchmarks/run_public_benchmarks.py

# run gate check
PYTHONPATH=. python benchmarks/run_public_benchmarks.py --check

# write report somewhere else
PYTHONPATH=. python benchmarks/run_public_benchmarks.py --report-dir docs/reports

# skip artifact generation
PYTHONPATH=. python benchmarks/run_public_benchmarks.py --no-report