Skip to content

int32 / float32 default dtypes risk silent overflow and precision loss #467

@MaxGhenis

Description

@MaxGhenis

Problem

policyengine_core/variables/config.py:16-29 sets the default dtypes for integer / float variables to int32 / float32:

  • int32 caps at 2.147e9. National-level aggregates (e.g. total SSA outlays, total payroll) overflow silently above that ceiling.
  • float32 carries ~7 decimal digits of precision. Incomes above ~$10M lose dollars; dollar-level tax calculations on multi-million-dollar incomes lose cents. 25_000_000 and 25_000_001 are indistinguishable in float32.

Every country package inherits this precision limit. The related bug H6 (assert_near also uses float32, being fixed in a separate PR) means the test suite cannot currently catch dollar-level regressions on large values.

Why this isn't a drop-in fix

Changing the defaults to int64 / float64 would make pre-existing H5 datasets incompatible with new readers and vice versa. Specifically:

  • Holder.put_in_cache / set_input currently casts incoming arrays to the variable's dtype.
  • Dataset.save writes arrays as-is into H5 — existing datasets contain float32 / int32 arrays.
  • Simulation.calculate returns holder.default_array() which is shaped by the dtype.

A naive swap would:

  1. Read existing H5 as float32, then upcast on every read (memory overhead).
  2. Read back datasets written by the new code as float64, but tax_benefit_system formulas assume the dtype matches variable.dtype.
  3. Invalidate every on-disk cache (each variable's .npy file stores the array with its original dtype).

Proposed migration plan

  1. Add an opt-in flag on TaxBenefitSystem — e.g. use_extended_precision: bool = False — that forces all newly-built variables to int64 / float64.
  2. For one release, emit a DeprecationWarning when a country package constructs a variable with the default int32/float32 dtype.
  3. Bump that default to int64/float64 the release after.
  4. Provide a migration utility: policyengine-core data migrate-dtype <path-to-h5> that promotes arrays in-place.
  5. Update country-package CI to validate that their data still round-trips through the new defaults.

References

Identified in the 2026-04 bug hunt (finding H5). Related: H6 (assert_near float32 downcast) is being fixed separately in a non-breaking way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions