Layer-Wise Probing of Paralinguistic Attributes in Fine-Tuned Whisper for Kazakh Speech
Corresponding author: Shirali Kadyrov
Abstract
Large pre-trained speech models similar to Whisper are now commonly used for speech recognition and related tasks. The distribution of paralinguistic features, which include emotions and speaker characteristics across model layers, remains uncertain, particularly for low-resource languages. The current study evaluates each layer of the Kazakh-adapted Whisper encoder to determine its performance in recognizing emotional expression, speaker identity, age, and gender attributes. We extract fixed representations from every encoder layer and test them with both linear and Multilayer Perceptron (MLP) probes. The evaluation process uses accuracy, macro-averaged F1-score (Macro-F1), and balanced accuracy metrics, whereas non-parametric statistical tests evaluate the importance of changes across different layers. The experimental evaluation of KazEmoTTS focuses on emotional expression, whereas Common Voice (Kazakh) data serve for speaker identification and demographic attribute analysis. The results demonstrate that age and gender information are strongly present at all layers of the model with little change in representation across depths, yet speaker identity shows statistically significant but weak variations between layers. Emotion information appears mainly in the model's middle layers, which is the area where probing is most effective. The research findings reveal how Whisper processes Kazakh speech, allowing researchers to choose appropriate layers for paralinguistic speech applications.
Keywords:
layer-wise probing, speech foundation models, Whisper, emotion recognition, speaker attributes, speaker identification, age prediction, gender prediction, low-resource languages, Kazakh speechDownloads
References
S. Khan, S. A. Ali, and J. Sallar, "Analysis of Children's Prosodic Features Using Emotion Based Utterances in Urdu Language," Engineering, Technology & Applied Science Research, vol. 8, no. 3, pp. 2954–2957, June 2018. DOI: https://doi.org/10.48084/etasr.1902
Z. Zhu and Y. Sato, "Deep Investigation of Intermediate Representations in Self-Supervised Learning Models for Speech Emotion Recognition," in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, Rhodes Island, Greece, 2023, pp. 1–5. DOI: https://doi.org/10.1109/ICASSPW59220.2023.10193018
M. Kim, J. S. Um, and H. Kim, "How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer." arXiv, Jan. 25, 2026.
H. Roubhi, A. H. Gharbi, K. Rouabah, and P. Ravier, "Mutual Information-based Feature Selection Strategy for Speech Emotion Recognition using Machine Learning Algorithms Combined with the Voting Rules Method," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 19207–19213, Feb. 2025. DOI: https://doi.org/10.48084/etasr.9066
S. Yang et al., "A Large-Scale Evaluation of Speech Foundation Models," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2884–2899, 2024. DOI: https://doi.org/10.1109/TASLP.2024.3389631
A. Pasad, B. Shi, and K. Livescu, "Comparative Layer-Wise Analysis of Self-Supervised Speech Models," in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096149
A. Singh and A. Gupta, "Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition." arXiv, Aug. 17, 2023.
S. G. Upadhyay, C. Busso, and C.-C. Lee, "A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition," presented at the Proceedings of Interspeech 2024, Kos Island, Greece, 2024, pp. 4693–4697. DOI: https://doi.org/10.21437/Interspeech.2024-469
A. Y. F. Chiu, K. C. Fung, R. T. Y. Li, J. Li, and T. Lee, "A Large-Scale Probing Analysis of Speaker-Specific Attributes in Self-Supervised Speech Representations." arXiv, Sept. 18, 2025.
A. Waheed, H. Atwany, B. Raj, and R. Singh, "What Do Speech Foundation Models Not Learn About Speech?" arXiv, Oct. 16, 2024.
B. Kynabay, A. Aldabergen, S. Kadyrov, and A. Shalkarbay-Uly, “Fine-tuning OpenAI's Whisper Model for Kazakh Speech Recognition.” ResearchGate, Dec. 13, 2025.
A. Abilbekov, S. Mussakhojayeva, R. Yeshpanov, and H. A. Varol, "KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis," in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italia, 2024, pp. 9626–9632. DOI: https://doi.org/10.63317/4fmgnmr2hfia
R. Ardila et al., "Common Voice: A Massively-Multilingual Speech Corpus," in Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4218–4222.
M. Hollander, D. A. Wolfe, and N. Chicken, Nonparametric Statistical Methods, 3rd ed. Hoboken, NJ, USA: John Wiley & Sons, 2013.
E. Tomczak and M. Tomczak, "The need to report effect size estimates revisited. An overview of some recommended measures of effect size," Trends in Sport Sciences, vol. 21, no. 1, pp. 19–25, Feb. 2014.
Downloads
How to Cite
License
Copyright (c) 2026 Aimoldir Aldabergen, Bakdaulet Kynabay, Shirali Kadyrov

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
