Leveraging Cross-Attention and Speech Separation for Enhanced Stress Detection in Children's Multi-Speaker Environments

Phie Chyan; Heni Gerda Pesau; Norbertus Tri Suswanto Saptadi

doi:10.48084/etasr.17765

Authors

Phie Chyan Department of Informatics, Atma Jaya Makassar University, Makassar, South Sulawesi, Indonesia
Heni Gerda Pesau Department of Psychology, Atma Jaya Makassar University, Makassar, South Sulawesi, Indonesia
Norbertus Tri Suswanto Saptadi Department of Informatics, Atma Jaya Makassar University, Makassar, South Sulawesi, Indonesia

Volume: 16 | Issue: 3 | Pages: 35238-35246 | June 2026 | https://doi.org/10.48084/etasr.17765

Received: 26 January 2026 | Revised: 27 February 2026 | Accepted: 15 March 2026 | Online: 6 June 2026

Corresponding author: Phie Chyan

Abstract

Stress detection in children presents unique challenges due to their limited ability to articulate emotional distress, necessitating automated and multimodal assessment approaches. This study presents a framework for stress recognition in noisy, multi-speaker environments by integrating speech separation with cross-attention-based multimodal fusion. The pipeline first employs a speech separation module to disentangle overlapping voices and suppress environmental noise, enabling reliable extraction of discriminative acoustic features. In parallel, transcripts generated via an Automatic Speech Recognition (ASR) system are transformed into linguistic representations using GloVe embeddings enhanced with TF-IDF weighting. The acoustic and linguistic features are projected into a shared latent space and fused through a cross-attention mechanism to model complementary cross-modal interactions. To address domain variability in children's vocal characteristics, the model is pretrained on adult emotional speech data and subsequently fine-tuned on child-specific samples to facilitate domain adaptation. Experimental results demonstrate that the proposed system achieves an accuracy of 89.5%, significantly outperforming unimodal baselines. Ablation studies further validate the critical contributions of speech separation and dynamic multimodal fusion to overall performance. These findings underscore the potential of the proposed framework as a supportive, non-invasive tool for early stress awareness in child-centered environments.

Keywords:

multimodal stress detection, cross-attention mechanism, speech separation, acoustic-linguistic fusion

References

A. Sood, D. Sharma, M. Sharma, and R. Dey, "Prevalence and repercussions of stress and mental health issues on primary and middle school students: a bibliometric analysis," Frontiers in Psychiatry, vol. 15, Sept. 2024, Art. no. 1369605.

M. Solmi et al., "Age at onset of mental disorders worldwide: large-scale meta-analysis of 192 epidemiological studies," Molecular Psychiatry, vol. 27, no. 1, pp. 281–295, Jan. 2022.

P. Morgado and J. J. Cerqueira, "Editorial: The Impact of Stress on Cognition and Motivation," Frontiers in Behavioral Neuroscience, vol. 12, Dec. 2018, Art. no. 326.

C. A. Kearney, A. Freeman, and V. Bacon, "Structured and semistructured interviews for children," in Handbook of Psychological Assessment, Elsevier, 2019, pp. 337–353.

E. Macleod, J. Woolford, L. Hobbs, J. Gross, H. Hayne, and T. Patterson, "Interviews with children about their mental health problems: The congruence and validity of information that children report," Clinical Child Psychology and Psychiatry, vol. 22, no. 2, pp. 229–244, Apr. 2017.

S. S. Shinde and A. S. Ghotkar, "From Questionnaires to Actionable Insights: Machine Learning for Mental Stress Detection," Engineering, Technology & Applied Science Research, vol. 15, no. 6, pp. 29240–29250, Dec. 2025.

M. Van Puyvelde, X. Neyt, F. McGlone, and N. Pattyn, "Voice Stress Analysis: A New Framework for Voice and Effort in Human Performance," Frontiers in Psychology, vol. 9, Nov. 2018, Art. no. 1994.

Y. Choi, Y. M. Jeon, L. Wang, and K. Kim, "A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices," Sensors, vol. 17, no. 9, Aug. 2017, Art. no. 1936.

G. M. Slavich, S. Taylor, and R. W. Picard, "Stress measurement using speech: Recent advancements, validation issues, and ethical and privacy considerations," Stress, vol. 22, no. 4, pp. 408–413, July 2019.

L. Lavanya and N. Vasavya, "Stress Recognition in Speech – A Survey of The State of The Art," Journal of Neonatal Surgery, vol. 14, no. 5S, pp. 793–798, Mar. 2025.

P. Tiwari and A. D. Darji, "Pertinent feature selection techniques for automatic emotion recognition in stressed speech," International Journal of Speech Technology, vol. 25, no. 2, pp. 511–526, June 2022.

P. Lu, L. Tsao, and L. Ma, "Daily stress detection from real-life speeches using acoustic and semantic information," Ergonomics, vol. 68, no. 10, pp. 1694–1717, Oct. 2025.

P. Chyan, A. Achmad, I. Nurtanio, and I. S. Areni, “Multi-Stage Approach for Stress Detection Using Speech Lexical Analysis,” in 2023 IEEE 7th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), Purwokerto, Indonesia, Aug. 2023, pp. 157–162.

M. Liu and Y. Zhang, "A Review of Speech Separation Focusing on TasNet, Conv-TasNet, and DPRNN," in 2025 5th International Conference on Sensors and Information Technology, Mar. 2025, pp. 880–885.

J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, "LibriMix: An Open-Source Dataset for Generalizable Speech Separation." arXiv, 2020.

G. Wichern et al., "WHAM!: Extending Speech Separation to Noisy Environments," in Interspeech 2019, Sept. 2019, pp. 1368–1372.

C. Busso et al., "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, Dec. 2008.

J. A. Russell, "A circumplex model of affect," Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, Dec. 1980.

N. F. Narvaez Linares, V. Charron, A. J. Ouimet, P. R. Labelle, and H. Plamondon, "A systematic review of the Trier Social Stress Test methodology: Issues in promoting study comparison and replicable research," Neurobiology of Stress, vol. 13, Nov. 2020, Art. no. 100235.

Y. Luo, Z. Chen, and T. Yoshioka, "Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation," in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 46–50.

J. Pennington, R. Socher, and C. Manning, "Glove: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

A. Baird et al., "An Evaluation of Speech-Based Recognition of Emotional and Physiological Markers of Stress," Frontiers in Computer Science, vol. 3, Dec. 2021, Art. no. 750284.

Y. Eom and J. Bang, "Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients," Journal of Information and Communication Convergence Engineering, vol. 19, no. 3, pp. 148–154, Sept. 2021.

S. W. Byun, J. H. Kim, and S. P. Lee, "Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding," Applied Sciences, vol. 11, no. 17, Aug. 2021, Art. no. 7967.

N. A. Zainal, A. L. Asnawi, A. Z. Jusoh, S. N. Ibrahim, and H. A. Mohd. Ramli, "Integration of MFCCs and CNN for Multi-Class Stress Speech Classification on Unscripted Dataset," IIUM Engineering Journal, vol. 25, no. 2, pp. 381–395, July 2024.