What properties of reasoning supervision are associated with improved downstream model quality?

arXiv

Miko{\l}aj Langner, Dzmitry Pihulski, Jan Eliasz, Micha{\l} Rajkowski, Przemys{\l}aw Kazienko, Maciej Piasecki, Jan Koco\'n, Teddy Ferdinan

May 14, 2026, 12:00 AM

arXiv:2605.13290v1 Announce Type: new Abstract: Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.