Why Model Choice Matters in Targeting Australian Financial Hardship
Australian welfare agencies increasingly use predictive models to target financial-hardship support — KiwiSaver-style supplements, food vouchers, hardship grants — at the household-by-household level. The standard model-evaluation literature in this setting privileges discrimination metrics (AUC, C-statistic, Brier score) which capture how well a model orders households by risk, but those metrics are silent on a second property that matters as much for policy: calibration, the agreement between the predicted probability of an event and the observed event rate at every level of predicted risk. A model that is well-discriminating but poorly calibrated may concentrate intervention on people who will not actually experience the predicted outcome, or under-allocate to people who will.
This paper asks whether the choice of survival-ML model materially changes the calibration of post-shock financial-hardship predictions, and whether the discrimination-best model is also the calibration-best. The conceptual frame is the Van Calster, McLernon, van Smeden, Wynants and Steyerberg argument that calibration is the under-attended Achilles heel of predictive analytics; the methodological frame is the Austin and Steyerberg Integrated Calibration Index plus the Crowson, Atkinson and Therneau calibration-in-the-large / calibration-slope decomposition, applied in a competing-risks context per Fine and Gray.
The empirical setting is the rebuilt HILDA analytical panel covering waves 1–22 (2001–2022), 454,861 person-wave rows with 140 columns of pre-shock household covariates, validated against Module C2 hardship items as the competing-risks outcome family. We train three model classes: Cox proportional hazards (the discipline's default), DeepHit (Lee, Zame, Yoon, van der Schaar 2018 — a deep-learning competing-risks survival model with no proportional-hazards assumption), and a hybrid Cox + DeepHit ensemble that uses the Cox linear predictor as a feature into the DeepHit head. Train / validation / test splits preserve the temporal structure of the panel to avoid leakage.
The contribution is policy-grounded: we report not only which model wins on AUC, but which wins on ICI, on calibration-in-the-large at the high-risk decile (where welfare targeting concentrates), and on the calibration slope. The paper is positioned as a methodological flagship for the broader programme on household-finance ML and as the calibration backbone for the three companion HILDA 2026 papers on hardship sequencing, residential mobility, and health-shock pathways, all of which share the same rebuilt analytical panel. Submission target: Review of Financial Studies / Journal of Banking & Finance.