基于YOLOv5的口吃类型检测研究

程振, 贾嘉敏, 蒋作, 王欣

PDF(4398 KB)
PDF(4398 KB)
云南民族大学学报(自然科学版) ›› 2025, Vol. 34 ›› Issue (01) : 84-92. DOI: 10.3969/j.issn.1672-8513.2025.01.011
信息与计算机科学

基于YOLOv5的口吃类型检测研究

作者信息 +

Research on stuttering type detection based on YOLOv5

Author information +
History +

摘要

语言交流效率得分是量化口吃严重程度的方法,该方法需要获得口吃发生的时间,但目前相关研究仅能判断语音段中是否存在口吃,无法精确定位口吃的发生位置,不利于对口吃严重程度的判别.针对目前深度学习检测口吃类型无法可视化定位目标的问题,首先使用短时傅里叶变换将语音转化为语谱图,然后对其进行口吃类型标记,最后使用YOLOv5对口吃类型进行检测.在YOLOv5的基础框架下尝试YOLOv5s、YOLOv5m、YOLOv5l、YOLOv5x 4种不同深度和宽度的模型,实现口吃类型的分类和定位,并选择在其性能最优的模型YOLOv5l中引入高效通道注意力机制和CIOU目标框损失函数对基础模型进行改进.实验结果表明,改进的YOLOv5l模型在训练损失值有明显降低,在准确率、召回率和mAP_0.5上分别提升了1.2、0.6和0.4个百分点,较原模型漏检情况有所改善.

Abstract

The language communication efficiency score is a method to quantify the severity of stuttering. This method requires the time when the stuttering occurs. However, current related research can only determine whether there is stuttering in the speech segment, and cannot accurately locate the stuttering, which is not conducive to the identification of severity of stuttering. In view of the problem that the current deep learning detection of stuttering type cannot visually locate the target, this paper first uses short-term Fourier transform to convert the speech into a spectrogram, then marks the stuttering type, and finally uses YOLOv5 to detect the stuttering type. Under the basic framework of YOLOv5, four models of different depth and width of YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x are tried to realize the classification and positioning of stuttering types, and the efficient attention mechanism and CIOU target box loss function are introduced into with the best performance to improve the basic model. The experimental results show that the improved YOLOv5l model has a significant reduction in the training loss value, and the accuracy, recall and mAP_0.5 are increased by 1.2, 0.6 and 0.4 percentage point respectively, which is an improvement compared with the miss detection of the original model.

关键词

YOLOv5 / 口吃识别 / 语谱图 / 目标检测

Key words

YOLOv5 / stuttering recognition / spectrogram / target detection

中图分类号

TN912.34

引用本文

导出引用
程振 , 贾嘉敏 , 蒋作 , . 基于YOLOv5的口吃类型检测研究. 云南民族大学学报(自然科学版). 2025, 34(01): 84-92 https://doi.org/10.3969/j.issn.1672-8513.2025.01.011
CHENG Zhen, JIA Jia-min, JIANG Zuo, et al. Research on stuttering type detection based on YOLOv5[J]. Journal of Yunnan University of Nationalities(Natural Sciences Edition). 2025, 34(01): 84-92 https://doi.org/10.3969/j.issn.1672-8513.2025.01.011

参考文献

1
GUITAR B. Stuttering: an integrated approach to its nature and treatment [M]. Baltimore, MD: Williams& Wilkins, 1998.
2
SALTUKLAROGLU T KALINOWSKI J. How effective is therapy for childhood stuttering? Dissecting and reinterpreting the evidence in light of spontaneous recovery rates[J]. International Journal of Language & Communication Disorders200540(3):359-374.
3
HOWELL P HAMILTON A KYRIACOPOULOS A.Automatic recognition of repetitions and prolongations in stuttered speech[C]//Proceedings of the First World Congress on Fluency Disorders.Nijmegen, The Netherlands: University Press Nijmegen, 19952: 372-374.
4
RAVIKUMAR K M KUDVA S RAJAGOPAL R, et al. Development of a procedure for the automatic recognition of disfluencies in the speech of people who stutter[C]//International Conference on Advanced Computing Technologies, Hyderbad, India. 2008: 514-519.
5
CHEE L S AI O C YAACOB S. Overview of automatic stuttering recognition system[C]//Proc. International Conference on Man-Machine Systems, no. October, Batu Ferringhi, Penang Malaysia. 2009: 1-6.
6
ALHARBI S HASAN M SIMONS A J H, et al. A lightly supervised approach to detect stuttering in children's speech[C]//Proceedings of Interspeech 2018. ISCA, 2018: 3433-3437.
7
KOURKOUNAKIS T HAJAVI A ETEMAD A. Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 6089-6093.
8
AMIR O SHAPIRA Y MICK L. The speech efficiency score (SES): A time-domain measure of speech fluency[J]. Journal of fluency disorders201858: 61-69.
9
REDMON J DIVVALA S GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
10
WANG Q WU B ZHU P,et al .ECA-Net: efficient channel attention for deep convolutional neural networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2020: 11534-11542.
11
REZATOFIGHI H TSOI N GWAK J Y, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 658-666.
12
ZHENG Z WANG P LIU W, et al.Distance-IoU Loss: faster and better learning for bounding box regression[J].Proceedings of the AAAI Conference on Artificial Intelligence202034(7):12993-13000.
13
PETER H STEPHEN D,JON B.The university college London archive of stuttered speech (UCLASS).[J].Journal of speech, language, and hearing research : JSLHR200952(2):556-569.
14
American Speech-Language-Hearing Association. Childhood fluency disorders [EB/OL]. 2020/2021-10-19.

基金

国家自然科学基金(61866040)

评论

PDF(4398 KB)

Accesses

Citation

Detail

段落导航
相关文章

/