Layer-Wise Probing of Paralinguistic Attributes in Fine-Tuned Whisper for Kazakh Speech

Aimoldir Aldabergen; Bakdaulet Kynabay; Shirali Kadyrov

doi:10.48084/etasr.17076

Authors

Aimoldir Aldabergen School of Sciences and Humanities, Nazarbayev University, Almaty, Kazakhstan https://orcid.org/0009-0005-4067-2906
Bakdaulet Kynabay Faculty of Engineering and Natural Sciences, SDU University, Almaty, Kazakhstan https://orcid.org/0009-0007-9329-7964
Shirali Kadyrov Department of General Education, New Uzbekistan University, Tashkent, Uzbekistan https://orcid.org/0000-0002-8352-2597

Volume: 16 | Issue: 2 | Pages: 33399-33404 | April 2026 | https://doi.org/10.48084/etasr.17076

Received: 20 December 2025 | Revised: 29 January 2026 and 4 February 2026 | Accepted: 7 February 2026 | Online: 25 March 2026
Corresponding author: Shirali Kadyrov

Abstract

Large pre-trained speech models similar to Whisper are now commonly used for speech recognition and related tasks. The distribution of paralinguistic features, which include emotions and speaker characteristics across model layers, remains uncertain, particularly for low-resource languages. The current study evaluates each layer of the Kazakh-adapted Whisper encoder to determine its performance in recognizing emotional expression, speaker identity, age, and gender attributes. We extract fixed representations from every encoder layer and test them with both linear and Multilayer Perceptron (MLP) probes. The evaluation process uses accuracy, macro-averaged F1-score (Macro-F1), and balanced accuracy metrics, whereas non-parametric statistical tests evaluate the importance of changes across different layers. The experimental evaluation of KazEmoTTS focuses on emotional expression, whereas Common Voice (Kazakh) data serve for speaker identification and demographic attribute analysis. The results demonstrate that age and gender information are strongly present at all layers of the model with little change in representation across depths, yet speaker identity shows statistically significant but weak variations between layers. Emotion information appears mainly in the model's middle layers, which is the area where probing is most effective. The research findings reveal how Whisper processes Kazakh speech, allowing researchers to choose appropriate layers for paralinguistic speech applications.

Keywords:

layer-wise probing, speech foundation models, Whisper, emotion recognition, speaker attributes, speaker identification, age prediction, gender prediction, low-resource languages, Kazakh speech

Downloads

Download data is not yet available.

References

S. Khan, S. A. Ali, and J. Sallar, "Analysis of Children's Prosodic Features Using Emotion Based Utterances in Urdu Language," Engineering, Technology & Applied Science Research, vol. 8, no. 3, pp. 2954–2957, June 2018. DOI: https://doi.org/10.48084/etasr.1902

Z. Zhu and Y. Sato, "Deep Investigation of Intermediate Representations in Self-Supervised Learning Models for Speech Emotion Recognition," in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, Rhodes Island, Greece, 2023, pp. 1–5. DOI: https://doi.org/10.1109/ICASSPW59220.2023.10193018

M. Kim, J. S. Um, and H. Kim, "How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer." arXiv, Jan. 25, 2026.

H. Roubhi, A. H. Gharbi, K. Rouabah, and P. Ravier, "Mutual Information-based Feature Selection Strategy for Speech Emotion Recognition using Machine Learning Algorithms Combined with the Voting Rules Method," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 19207–19213, Feb. 2025. DOI: https://doi.org/10.48084/etasr.9066

S. Yang et al., "A Large-Scale Evaluation of Speech Foundation Models," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2884–2899, 2024. DOI: https://doi.org/10.1109/TASLP.2024.3389631

A. Pasad, B. Shi, and K. Livescu, "Comparative Layer-Wise Analysis of Self-Supervised Speech Models," in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096149

A. Singh and A. Gupta, "Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition." arXiv, Aug. 17, 2023.

S. G. Upadhyay, C. Busso, and C.-C. Lee, "A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition," presented at the Proceedings of Interspeech 2024, Kos Island, Greece, 2024, pp. 4693–4697. DOI: https://doi.org/10.21437/Interspeech.2024-469

A. Y. F. Chiu, K. C. Fung, R. T. Y. Li, J. Li, and T. Lee, "A Large-Scale Probing Analysis of Speaker-Specific Attributes in Self-Supervised Speech Representations." arXiv, Sept. 18, 2025.

A. Waheed, H. Atwany, B. Raj, and R. Singh, "What Do Speech Foundation Models Not Learn About Speech?" arXiv, Oct. 16, 2024.

B. Kynabay, A. Aldabergen, S. Kadyrov, and A. Shalkarbay-Uly, “Fine-tuning OpenAI's Whisper Model for Kazakh Speech Recognition.” ResearchGate, Dec. 13, 2025.

A. Abilbekov, S. Mussakhojayeva, R. Yeshpanov, and H. A. Varol, "KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis," in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italia, 2024, pp. 9626–9632. DOI: https://doi.org/10.63317/4fmgnmr2hfia

R. Ardila et al., "Common Voice: A Massively-Multilingual Speech Corpus," in Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4218–4222.

M. Hollander, D. A. Wolfe, and N. Chicken, Nonparametric Statistical Methods, 3rd ed. Hoboken, NJ, USA: John Wiley & Sons, 2013.

E. Tomczak and M. Tomczak, "The need to report effect size estimates revisited. An overview of some recommended measures of effect size," Trends in Sport Sciences, vol. 21, no. 1, pp. 19–25, Feb. 2014.