In Preparation Tier A*

Calibrated Predictions for Welfare Decisions

Why Model Choice Matters in Targeting Australian Financial Hardship

Balloch, A.

In preparation. This paper is currently being drafted for the HILDA Survey Research Conference 2026 and a Tier-A finance journal. Findings reported below are from completed analysis runs; the manuscript and external links (SSRN, ResearchGate, journal DOI) will be added on first public release. Email [email protected] for the working draft.

Abstract

Australian welfare agencies increasingly use predictive models to target financial-hardship support — KiwiSaver-style supplements, food vouchers, hardship grants — at the household-by-household level. The standard model-evaluation literature in this setting privileges discrimination metrics (AUC, C-statistic, Brier score) which capture how well a model orders households by risk, but those metrics are silent on a second property that matters as much for policy: calibration, the agreement between the predicted probability of an event and the observed event rate at every level of predicted risk. A model that is well-discriminating but poorly calibrated may concentrate intervention on people who will not actually experience the predicted outcome, or under-allocate to people who will.

This paper asks whether the choice of survival-ML model materially changes the calibration of post-shock financial-hardship predictions, and whether the discrimination-best model is also the calibration-best. The conceptual frame is the Van Calster, McLernon, van Smeden, Wynants and Steyerberg argument that calibration is the under-attended Achilles heel of predictive analytics; the methodological frame is the Austin and Steyerberg Integrated Calibration Index plus the Crowson, Atkinson and Therneau calibration-in-the-large / calibration-slope decomposition, applied in a competing-risks context per Fine and Gray.

The empirical setting is the rebuilt HILDA analytical panel covering waves 1–22 (2001–2022), 454,861 person-wave rows with 140 columns of pre-shock household covariates, validated against Module C2 hardship items as the competing-risks outcome family. We train three model classes: Cox proportional hazards (the discipline's default), DeepHit (Lee, Zame, Yoon, van der Schaar 2018 — a deep-learning competing-risks survival model with no proportional-hazards assumption), and a hybrid Cox + DeepHit ensemble that uses the Cox linear predictor as a feature into the DeepHit head. Train / validation / test splits preserve the temporal structure of the panel to avoid leakage.

The contribution is policy-grounded: we report not only which model wins on AUC, but which wins on ICI, on calibration-in-the-large at the high-risk decile (where welfare targeting concentrates), and on the calibration slope. The paper is positioned as a methodological flagship for the broader programme on household-finance ML and as the calibration backbone for the three companion HILDA 2026 papers on hardship sequencing, residential mobility, and health-shock pathways, all of which share the same rebuilt analytical panel. Submission target: Review of Financial Studies / Journal of Banking & Finance.

Data & Methods

Data Source
Rebuilt HILDA analytical panel (waves 1–22, 2001–2022; 454,861 person-wave rows × 140 columns) with competing-risks outcome family constructed from HILDA Module C2 hardship items
Methods (existing)
Cox proportional hazards; DeepHit deep-learning survival neural network (Lee, Zame, Yoon, van der Schaar 2018); hybrid Cox + DeepHit ensemble; Integrated Calibration Index (Austin & Steyerberg 2019); calibration-in-the-large and calibration slope (Crowson, Atkinson, Therneau 2016); competing-risks subdistribution (Fine & Gray 1999); train/val/test split with temporal-leakage controls
Primary target
Review of Financial Studies (fallback: Journal of Banking and Finance, Review of Finance)
Preparation status
Code base complete (Cox + DeepHit + hybrid training scripts); cross-model predictions extracted to test/val CSVs; calibration_summary.md written; trained weight checkpoints saved 2026-05-10. Abstract draft pending.
SSRN (on first release) ResearchGate (on first release) Request working draft