Analisis Ketahanan Lightweight Audio Spectrogram Transformer pada Identifikasi Pembicara Kondisi Berderau
DOI:
https://doi.org/10.14421/jiska.6170Keywords:
Speaker Identification, Audio Spectrogram Transformer, Edge AI, Robustness Analysis, Deep LearningAbstract
The use of deep learning models for speaker identification on devices with limited computational resources requires significant architectural optimization. This study evaluates the performance and robustness of the Lightweight Audio Spectrogram Transformer (AST) architecture, which has been extremely compressed to 570,536 parameters. The proposed method uses low-resolution Mel-Spectrogram representations (64x64 pixels) as input for a global self-attention mechanism. Testing was conducted using a 5-Fold Cross Validation scheme on a dataset injected with non-stationary environmental noise from the ESC-50 corpus at various Signal-to-Noise Ratio (SNR) levels. Experimental results show that under ideal conditions, the model achieves a solid average validation accuracy of 70.86% ± 2.69% with a Macro Average F1-score of 0.68 ± 0.03. However, the model’s performance degrades sharply to 17.61% at an SNR of 5 dB and drops to 9.21% under extreme conditions at an SNR of 0 dB. These findings reveal a critical trade-off where radical parameter compression leads to the loss of spectral feature redundancy that acts as an implicit noise filter. This study concludes that while lightweight Transformer mechanisms are highly efficient for Edge AI, the integration of pre-processing modules or noise-robust training strategies is an absolute necessity to maintain identification integrity in noisy real-world environments.
References
Adnan, F., Amelia, I., & Shiddiq, U. (2022). Implementasi Voice Recognition Berbasis Machine Learning. Edu Elektrika Journal, 11(1), 24–29.
Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Alharbi, S., Alturki, S., Alshehri, F., & Almojil, M. (2021). Automatic Speech Recognition : Systematic Literature Review. IEEE Access, 9, 131858–131876. https://doi.org/10.1109/ACCESS.2021.3112535
Chen, X., Wang, M., Kan, R., & Qiu, H. (2024). Improved Patch-Mix Transformer and Contrastive Learning Method for Sound Classification in Noisy Environments. In Applied Sciences (Vol. 14, Issue 21, p. 9711). https://doi.org/10.3390/app14219711
Gong, Y., Chung, Y., & Glass, J. (2021). AST : Audio Spectrogram Transformer. Proceedings of Interspeech, 571–575.
Huang, Z., Chen, M., & Zheng, S. (2026). Dynamic spectral weighting in CausalSelfAttention: Enhancing transformer performance through frequency-based head modulation. Neurocomputing, 670, 2–16. https://doi.org/10.1016/j.neucom.2025.132562
Jeon, S., & Kim, M. S. (2022). Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications. In Sensors (Vol. 22, Issue 20, p. 7738). https://doi.org/10.3390/s22207738
Kamiński, K. A., Dobrowolski, A. P., Piotrowski, Z., & Ścibiorek, P. (2023). Enhancing Web Application Security: Advanced Biometric Voice Verification for Two-Factor Authentication. In Electronics (Vol. 12, Issue 18, p. 3791). https://doi.org/10.3390/electronics12183791
Li, D., Gao, Y., Zhu, C., Wang, Q., & Wang, R. (2023). Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy. Sensors, 23(4), 2–16.
Liu, F., & Fang, J. (2023). Multi-Scale Audio Spectrogram Transformer for Classroom Teaching Interaction Recognition. In Future Internet (Vol. 15, Issue 2, p. 65). https://doi.org/10.3390/fi15020065
Liu, J., & Huang, H. (2023). Fundamental frequency extraction model using convolutional neural networks with non-local modules. Jisuanji Gongcheng/Computer Engineering, 49(3), 128–133and160. https://doi.org/10.19678/j.issn.1000-3428.0063987
Mannem, K. R., Mengiste, E., Hasan, S., de Soto, B. G., & Sacks, R. (2024). Smart audio signal classification for tracking of construction tasks. Automation in Construction, 165, 105485. https://doi.org/https://doi.org/10.1016/j.autcon.2024.105485
Martin-Salinas, I., Badia, J. M., Valls, O., Leon, G., del Amor, R., Belloch, J. A., Amor-Martin, A., & Naranjo, V. (2024). Evaluating and accelerating vision transformers on GPU-based embedded edge AI systems. The Journal of Supercomputing, 81(1), 349. https://doi.org/10.1007/s11227-024-06807-1
Mohd Hanifa, R., Isa, K., & Mohamad, S. (2021). A review on speaker recognition: Technology and challenges. Computers & Electrical Engineering, 90, 107005. https://doi.org/https://doi.org/10.1016/j.compeleceng.2021.107005
Oh, Y., Schwalm, M., & Kalpin, N. (2022). Multisensory benefits for speech recognition in noisy environments. International Journal of Computational Intelligence and Applications, 16(October), 1–10. https://doi.org/10.3389/fnins.2022.1031424
Wang, C., Ito, A., & Nose, T. (2025). Adaptive Fine-Grained Pruning via Binary Search for Efficient Environmental Sound Classification. IEEE Access, 13, 173201–173208. https://doi.org/10.1109/ACCESS.2025.3617879
Ye, F., & Yang, J. (2021). A Deep Neural Network Model for Speaker Identification. In Applied Sciences (Vol. 11, Issue 8, p. 3603). https://doi.org/10.3390/app11083603
Zaman, K., Sah, M., Direkoglu, C., & Unoki, M. (2023). A Survey of Audio Classification Using Deep Learning. IEEE Access, 11, 106620–106649. https://doi.org/10.1109/ACCESS.2023.3318015
Zeng, T., & Lau, F. C. M. (2023). Training audio transformers for cover song identification. EURASIP Journal on Audio, Speech, and Music Processing, 4. https://doi.org/10.1186/s13636-023-00297-4
Zhang, T., Shen, X., Tang, J., & Tan, S. (2025). Audio-visual speech enhancement with multi-level feature deep fusion under low signal-to-noise ratio. Tongxin Xuebao/Journal on Communications, 46(5), 133–144. https://doi.org/10.11959/j.issn.1000-436x.2025075
Zhang, X., Tang, J., Cao, H., Wang, C., Shen, C., & Liu, J. (2025). A Self-Supervised Method for Speaker Recognition in Real Sound Fields with Low SNR and Strong Reverberation. In Applied Sciences (Vol. 15, Issue 6, p. 2924). https://doi.org/10.3390/app15062924
Downloads
Published
Issue
Section
License
Copyright (c) 2026 I Kadek Arya Sugianta, Gde Palguna Reganata

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors who publish with this journal agree to the following terms as stated in http://creativecommons.org/licenses/by-nc/4.0
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.




