Analisis Ketahanan Lightweight Audio Spectrogram Transformer pada Identifikasi Pembicara Kondisi Berderau

I Kadek Arya Sugianta; Gde Palguna Reganata

doi:10.14421/jiska.6170

Authors

I Kadek Arya Sugianta Universitas Bali Internasional
Gde Palguna Reganata Universitas Bali Internasional

DOI:

https://doi.org/10.14421/jiska.6170

Keywords:

Speaker Identification, Audio Spectrogram Transformer, Edge AI, Robustness Analysis, Deep Learning

Abstract

The use of deep learning models for speaker identification on devices with limited computational resources requires significant architectural optimization. This study evaluates the performance and robustness of the Lightweight Audio Spectrogram Transformer (AST) architecture, which has been extremely compressed to 570,536 parameters. The proposed method uses low-resolution Mel-Spectrogram representations (64x64 pixels) as input for a global self-attention mechanism. Testing was conducted using a 5-Fold Cross Validation scheme on a dataset injected with non-stationary environmental noise from the ESC-50 corpus at various Signal-to-Noise Ratio (SNR) levels. Experimental results show that under ideal conditions, the model achieves a solid average validation accuracy of 70.86% ± 2.69% with a Macro Average F1-score of 0.68 ± 0.03. However, the model’s performance degrades sharply to 17.61% at an SNR of 5 dB and drops to 9.21% under extreme conditions at an SNR of 0 dB. These findings reveal a critical trade-off where radical parameter compression leads to the loss of spectral feature redundancy that acts as an implicit noise filter. This study concludes that while lightweight Transformer mechanisms are highly efficient for Edge AI, the integration of pre-processing modules or noise-robust training strategies is an absolute necessity to maintain identification integrity in noisy real-world environments.

References

Adnan, F., Amelia, I., & Shiddiq, U. (2022). Implementasi Voice Recognition Berbasis Machine Learning. Edu Elektrika Journal, 11(1), 24–29.

Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Alharbi, S., Alturki, S., Alshehri, F., & Almojil, M. (2021). Automatic Speech Recognition : Systematic Literature Review. IEEE Access, 9, 131858–131876. https://doi.org/10.1109/ACCESS.2021.3112535

Chen, X., Wang, M., Kan, R., & Qiu, H. (2024). Improved Patch-Mix Transformer and Contrastive Learning Method for Sound Classification in Noisy Environments. In Applied Sciences (Vol. 14, Issue 21, p. 9711). https://doi.org/10.3390/app14219711

Gong, Y., Chung, Y., & Glass, J. (2021). AST : Audio Spectrogram Transformer. Proceedings of Interspeech, 571–575.

Huang, Z., Chen, M., & Zheng, S. (2026). Dynamic spectral weighting in CausalSelfAttention: Enhancing transformer performance through frequency-based head modulation. Neurocomputing, 670, 2–16. https://doi.org/10.1016/j.neucom.2025.132562

Jeon, S., & Kim, M. S. (2022). Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications. In Sensors (Vol. 22, Issue 20, p. 7738). https://doi.org/10.3390/s22207738

Kamiński, K. A., Dobrowolski, A. P., Piotrowski, Z., & Ścibiorek, P. (2023). Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication. In Electronics (Vol. 12, Issue 18, p. 3791). https://doi.org/10.3390/electronics12183791

Li, D., Gao, Y., Zhu, C., Wang, Q., & Wang, R. (2023). Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy. Sensors, 23(4), 2–16.

Liu, F., & Fang, J. (2023). Multi-Scale Audio Spectrogram Transformer for Classroom Teaching Interaction Recognition. In Future Internet (Vol. 15, Issue 2, p. 65). https://doi.org/10.3390/fi15020065

Liu, J., & Huang, H. (2023). Fundamental frequency extraction model using convolutional neural networks with non-local modules. Jisuanji Gongcheng/Computer Engineering, 49(3), 128–133and160. https://doi.org/10.19678/j.issn.1000-3428.0063987

Mannem, K. R., Mengiste, E., Hasan, S., de Soto, B. G., & Sacks, R. (2024). Smart audio signal classification for tracking of construction tasks. Automation in Construction, 165, 105485. https://doi.org/https://doi.org/10.1016/j.autcon.2024.105485

Martin-Salinas, I., Badia, J. M., Valls, O., Leon, G., del Amor, R., Belloch, J. A., Amor-Martin, A., & Naranjo, V. (2024). Evaluating and accelerating vision transformers on GPU-based embedded edge AI systems. The Journal of Supercomputing, 81(1), 349. https://doi.org/10.1007/s11227-024-06807-1

Mohd Hanifa, R., Isa, K., & Mohamad, S. (2021). A review on speaker recognition: Technology and challenges. Computers & Electrical Engineering, 90, 107005. https://doi.org/https://doi.org/10.1016/j.compeleceng.2021.107005

Oh, Y., Schwalm, M., & Kalpin, N. (2022). Multisensory benefits for speech recognition in noisy environments. International Journal of Computational Intelligence and Applications, 16(October), 1–10. https://doi.org/10.3389/fnins.2022.1031424

Wang, C., Ito, A., & Nose, T. (2025). Adaptive Fine-Grained Pruning via Binary Search for Efficient Environmental Sound Classification. IEEE Access, 13, 173201–173208. https://doi.org/10.1109/ACCESS.2025.3617879

Ye, F., & Yang, J. (2021). A Deep Neural Network Model for Speaker Identification. In Applied Sciences (Vol. 11, Issue 8, p. 3603). https://doi.org/10.3390/app11083603

Zaman, K., Sah, M., Direkoglu, C., & Unoki, M. (2023). A Survey of Audio Classification Using Deep Learning. IEEE Access, 11, 106620–106649. https://doi.org/10.1109/ACCESS.2023.3318015

Zeng, T., & Lau, F. C. M. (2023). Training audio transformers for cover song identification. EURASIP Journal on Audio, Speech, and Music Processing, 4. https://doi.org/10.1186/s13636-023-00297-4

Zhang, T., Shen, X., Tang, J., & Tan, S. (2025). Audio-visual speech enhancement with multi-level feature deep fusion under low signal-to-noise ratio. Tongxin Xuebao/Journal on Communications, 46(5), 133–144. https://doi.org/10.11959/j.issn.1000-436x.2025075

Zhang, X., Tang, J., Cao, H., Wang, C., Shen, C., & Liu, J. (2025). A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation. In Applied Sciences (Vol. 15, Issue 6, p. 2924). https://doi.org/10.3390/app15062924

Analisis Ketahanan Lightweight Audio Spectrogram Transformer pada Identifikasi Pembicara Kondisi Berderau

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Most read articles by the same author(s)

Make a Submission

AUTHOR INFORMATION

Indexed by

Statistic

Latest publications