Problem
policyengine_core/variables/config.py:16-29 sets the default dtypes for integer / float variables to int32 / float32:
int32 caps at 2.147e9. National-level aggregates (e.g. total SSA outlays, total payroll) overflow silently above that ceiling.
float32 carries ~7 decimal digits of precision. Incomes above ~$10M lose dollars; dollar-level tax calculations on multi-million-dollar incomes lose cents. 25_000_000 and 25_000_001 are indistinguishable in float32.
Every country package inherits this precision limit. The related bug H6 (assert_near also uses float32, being fixed in a separate PR) means the test suite cannot currently catch dollar-level regressions on large values.
Why this isn't a drop-in fix
Changing the defaults to int64 / float64 would make pre-existing H5 datasets incompatible with new readers and vice versa. Specifically:
Holder.put_in_cache / set_input currently casts incoming arrays to the variable's dtype.
Dataset.save writes arrays as-is into H5 — existing datasets contain float32 / int32 arrays.
Simulation.calculate returns holder.default_array() which is shaped by the dtype.
A naive swap would:
- Read existing H5 as float32, then upcast on every read (memory overhead).
- Read back datasets written by the new code as float64, but
tax_benefit_system formulas assume the dtype matches variable.dtype.
- Invalidate every on-disk cache (each variable's
.npy file stores the array with its original dtype).
Proposed migration plan
- Add an opt-in flag on
TaxBenefitSystem — e.g. use_extended_precision: bool = False — that forces all newly-built variables to int64 / float64.
- For one release, emit a
DeprecationWarning when a country package constructs a variable with the default int32/float32 dtype.
- Bump that default to
int64/float64 the release after.
- Provide a migration utility:
policyengine-core data migrate-dtype <path-to-h5> that promotes arrays in-place.
- Update country-package CI to validate that their data still round-trips through the new defaults.
References
Identified in the 2026-04 bug hunt (finding H5). Related: H6 (assert_near float32 downcast) is being fixed separately in a non-breaking way.
Problem
policyengine_core/variables/config.py:16-29sets the default dtypes for integer / float variables toint32/float32:int32caps at 2.147e9. National-level aggregates (e.g. total SSA outlays, total payroll) overflow silently above that ceiling.float32carries ~7 decimal digits of precision. Incomes above ~$10M lose dollars; dollar-level tax calculations on multi-million-dollar incomes lose cents. 25_000_000 and 25_000_001 are indistinguishable in float32.Every country package inherits this precision limit. The related bug H6 (
assert_nearalso uses float32, being fixed in a separate PR) means the test suite cannot currently catch dollar-level regressions on large values.Why this isn't a drop-in fix
Changing the defaults to
int64/float64would make pre-existing H5 datasets incompatible with new readers and vice versa. Specifically:Holder.put_in_cache/set_inputcurrently casts incoming arrays to the variable's dtype.Dataset.savewrites arrays as-is into H5 — existing datasets containfloat32/int32arrays.Simulation.calculatereturnsholder.default_array()which is shaped by the dtype.A naive swap would:
tax_benefit_systemformulas assume the dtype matchesvariable.dtype..npyfile stores the array with its original dtype).Proposed migration plan
TaxBenefitSystem— e.g.use_extended_precision: bool = False— that forces all newly-built variables toint64/float64.DeprecationWarningwhen a country package constructs a variable with the defaultint32/float32dtype.int64/float64the release after.policyengine-core data migrate-dtype <path-to-h5>that promotes arrays in-place.References
Identified in the 2026-04 bug hunt (finding H5). Related: H6 (
assert_nearfloat32 downcast) is being fixed separately in a non-breaking way.