int32 / float32 default dtypes risk silent overflow and precision loss

## Problem

``policyengine_core/variables/config.py:16-29`` sets the default dtypes for integer / float variables to ``int32`` / ``float32``:

- ``int32`` caps at 2.147e9. National-level aggregates (e.g. total SSA outlays, total payroll) overflow silently above that ceiling.
- ``float32`` carries ~7 decimal digits of precision. Incomes above ~\$10M lose dollars; dollar-level tax calculations on multi-million-dollar incomes lose cents. 25_000_000 and 25_000_001 are indistinguishable in float32.

Every country package inherits this precision limit. The related bug H6 (``assert_near`` also uses float32, being fixed in a separate PR) means the test suite cannot currently catch dollar-level regressions on large values.

## Why this isn't a drop-in fix

Changing the defaults to ``int64`` / ``float64`` would make pre-existing H5 datasets incompatible with new readers and vice versa. Specifically:

- ``Holder.put_in_cache`` / ``set_input`` currently casts incoming arrays to the variable's dtype.
- ``Dataset.save`` writes arrays as-is into H5 — existing datasets contain ``float32`` / ``int32`` arrays.
- ``Simulation.calculate`` returns ``holder.default_array()`` which is shaped by the dtype.

A naive swap would:

1. Read existing H5 as float32, then upcast on every read (memory overhead).
2. Read back datasets written by the new code as float64, but ``tax_benefit_system`` formulas assume the dtype matches ``variable.dtype``.
3. Invalidate every on-disk cache (each variable's ``.npy`` file stores the array with its original dtype).

## Proposed migration plan

1. Add an opt-in flag on ``TaxBenefitSystem`` — e.g. ``use_extended_precision: bool = False`` — that forces all newly-built variables to ``int64`` / ``float64``.
2. For one release, emit a ``DeprecationWarning`` when a country package constructs a variable with the default ``int32``/``float32`` dtype.
3. Bump that default to ``int64``/``float64`` the release after.
4. Provide a migration utility: ``policyengine-core data migrate-dtype <path-to-h5>`` that promotes arrays in-place.
5. Update country-package CI to validate that their data still round-trips through the new defaults.

## References

Identified in the 2026-04 bug hunt (finding H5). Related: H6 (``assert_near`` float32 downcast) is being fixed separately in a non-breaking way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int32 / float32 default dtypes risk silent overflow and precision loss #467

Problem

Why this isn't a drop-in fix

Proposed migration plan

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

int32 / float32 default dtypes risk silent overflow and precision loss #467

Description

Problem

Why this isn't a drop-in fix

Proposed migration plan

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions