本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2501.01384
摘要:随着大型语言模型的快速发展,研究人员已经创建了越来越先进的口语对话系统,可以自然地与人类交谈。然而,这些系统仍然难以处理真实世界对话的全部复杂性,包括音频事件,音乐上下文和情感表达,主要是因为当前的对话数据集在规模和场景多样性方面都受到限制。在本文中,我们建议利用合成数据来增强跨不同场景的对话模型。我们介绍了ShareChatX,这是第一个全面的,大规模的口语对话数据集,涵盖了不同的场景。基于这个数据集,我们介绍了OmniChat,一个多回合的对话系统,具有异构特征融合模块,旨在优化不同对话环境中的特征选择。此外,我们还探讨了使用合成数据训练对话系统的关键方面。通过全面的实验,我们确定了合成数据和真实数据之间的理想平衡,在真实世界对话数据集DailyTalk上实现了最先进的结果。我们还强调了合成数据在处理多样化、复杂的对话场景中的至关重要性,特别是涉及音频和音乐的场景。有关详细信息,请访问我们的演示页面,网址为\url{https://sharechatx.github.io/}。
摘要:With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.
标题: AdaptVC:具有自适应学习的高质量语音转换
链接:https://arxiv.org/abs/2501.01347
备注:4 pages, 3 figures. Audio samples are available in the demo page: this https URL
摘要:语音转换的目标是将源说话人的语音转换为参考说话人的语音,同时保留原始内容。一个关键的挑战是从源中提取分离的语言内容和从参考中提取语音风格。虽然现有的方法利用各种方法来隔离两者,但仍需要进一步关注泛化,特别是对于zero-shot场景中的鲁棒性。在本文中,我们实现了成功的解开的内容和扬声器功能调整自监督语音功能与适配器。适配器经过训练,可以从丰富的自监督特征中动态编码细微差别的特征,解码器将它们融合,以产生与参考准确相似的语音,同时将内容损失降至最低。此外,我们利用一个条件流匹配解码器与交叉注意扬声器条件,以进一步提高合成质量和效率。在zero-shot场景下的主观和客观评价表明,所提出的方法优于现有的模型在语音质量和相似度的参考语音。
摘要:The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.
标题: RingFormer:具有环注意力和卷积增强Transformer的神经声码器
链接:https://arxiv.org/abs/2501.01182
摘要:虽然Transformers在各种音频任务中表现出出色的性能,但它们在神经声码器中的应用仍然具有挑战性。神经声码器需要在样本级生成长音频信号,这需要高的时间分辨率。这导致了注意力地图生成的显著计算成本,并限制了它们有效处理全局和局部信息的能力。此外,神经声码器中样本生成的顺序性质给实时处理带来了困难,使得直接采用Transformers不切实际。为了解决这些挑战,我们提出了RingFormer,一个神经声码器,它将环注意力机制纳入一个轻量级的Transformer变体,卷积增强的Transformer(Conformer)。Ring attention在整合全局信息的同时有效地捕捉局部细节,使其非常适合处理长序列并实现实时音频生成。RingFormer使用具有两个鉴别器的对抗训练进行训练。该模型应用于文本到语音模型VITS的解码器,并与最先进的声码器,如HiFi-GAN,iSTFT-Net和BigVGAN在相同的条件下,使用各种客观和主观的指标进行比较。实验结果表明,RingFormer实现了与现有模型相当或更高的性能,特别是在实时音频生成方面表现出色。我们的代码和音频示例可以在GitHub上找到。
摘要:While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real-time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution-augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well-suited for processing long sequences and enabling real-time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text-to-speech model VITS and compared with state-of-the-art vocoders such as HiFi-GAN, iSTFT-Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real-time audio generation. Our code and audio samples are available on GitHub.
标题: 使用深度神经决策树和森林从咳嗽声音中稳健地检测COVID-19:全面的跨数据集评估
链接:https://arxiv.org/abs/2501.01117
备注:39 pages
摘要:这项研究提出了一种使用尖端机器学习技术对COVID-19咳嗽声进行分类的稳健方法。利用深度神经决策树和深度神经决策森林,我们的方法在不同的咳嗽声音数据集上表现出一致的性能。我们从全面提取特征开始,以捕捉来自个人的广泛音频特征,无论是COVID-19阳性还是阴性。为了确定最重要的特征,我们使用递归特征消除和交叉验证。贝叶斯优化微调深度神经决策树和深度神经决策森林模型的超参数。此外,我们在训练过程中集成了SMOTE,以确保正面和负面数据的平衡表示。通过阈值优化实现模型性能优化,最大化ROC-AUC评分。我们的方法在五个数据集上进行了全面的评估:Cambridge,Coswara,COUGHVID,Virufy以及Virufy与NoCoCoDa数据集的组合。与最先进的方法相比,我们提出的方法在各个数据集上产生了显著的AUC分数,分别为0.97、0.98、0.92、0.93、0.99和0.99。将所有数据集合并为一个组合数据集,我们的方法使用深度神经决策森林分类器,实现了0.97的AUC。此外,我们的研究还包括全面的跨数据集分析,揭示了与COVID-19相关的咳嗽声的人口统计学和地理差异。这些差异凸显了在不同数据集之间转移学习特征的挑战,并强调了数据集集成的潜在好处,提高了可推广性并增强了从音频信号中检测COVID-19的能力。
摘要:This research presents a robust approach to classifying COVID-19 cough sounds using cutting-edge machine-learning techniques. Leveraging deep neural decision trees and deep neural decision forests, our methodology demonstrates consistent performance across diverse cough sound datasets. We begin with a comprehensive extraction of features to capture a wide range of audio features from individuals, whether COVID-19 positive or negative. To determine the most important features, we use recursive feature elimination along with cross-validation. Bayesian optimization fine-tunes hyper-parameters of deep neural decision tree and deep neural decision forest models. Additionally, we integrate the SMOTE during training to ensure a balanced representation of positive and negative data. Model performance refinement is achieved through threshold optimization, maximizing the ROC-AUC score. Our approach undergoes a comprehensive evaluation in five datasets: Cambridge, Coswara, COUGHVID, Virufy, and the combined Virufy with the NoCoCoDa dataset. Consistently outperforming state-of-the-art methods, our proposed approach yields notable AUC scores of 0.97, 0.98, 0.92, 0.93, 0.99, and 0.99 across the respective datasets. Merging all datasets into a combined dataset, our method, using a deep neural decision forest classifier, achieves an AUC of 0.97. Also, our study includes a comprehensive cross-datasets analysis, revealing demographic and geographic differences in the cough sounds associated with COVID-19. These differences highlight the challenges in transferring learned features across diverse datasets and underscore the potential benefits of dataset integration, improving generalizability and enhancing COVID-19 detection from audio signals.
标题: MuQ:使用Mel剩余量量化的自监督音乐表示学习
链接:https://arxiv.org/abs/2501.01108
摘要:近年来,在各种音乐信息学理解任务中,包括音乐标记、乐器分类、键检测等,使用自监督学习(SSL)预训练的基础模型取得了成功。在本文中,我们提出了一个自我监督的音乐表示学习模型的音乐理解。与以往采用随机投影或现有神经编解码器的研究不同,该模型名为MuQ,用于训练预测Mel残差矢量量化(Mel-RVQ)生成的令牌。我们的Mel-RVQ利用残差线性投影结构的梅尔频谱量化,以提高目标提取的稳定性和效率,并导致更好的性能。在大量下游任务中的实验表明,MuQ优于以前的自监督音乐表示模型,只有0.9K小时的开源预训练数据。将数据扩展到超过16万小时,并采用迭代训练,可持续提高模型性能。为了进一步验证我们的模型的强度,我们提出了MuQ-MuLan,一个基于对比学习的联合音乐文本嵌入模型,它在MagnaTagATune数据集上的zero-shot音乐标记任务中实现了最先进的性能。代码和检查点在https://github.com/tencent-ailab/MuQ上是开源的。
摘要:Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.
标题: Fast:快速音频频谱图Transformer
链接:https://arxiv.org/abs/2501.01104
备注:Accepted at ICASSP 2025
摘要:在音频分类中,开发高效和鲁棒的模型对于实时应用至关重要。受MobileViT设计原则的启发,我们提出了FAST(快速音频频谱图Transformer),这是一种结合卷积神经网络(CNN)和Transformers的新架构,以利用两者的优势。FAST将CNN的本地特征提取效率与Transformers的全局上下文建模功能相结合,从而产生了一个功能强大但重量轻的模型,非常适合实时或移动用例。此外,我们采用Lipschitz连续注意力机制来提高训练稳定性和加速收敛。我们在ADIMA数据集上评估FAST,ADIMA数据集是一个面向实时亵渎和滥用检测的多语言语料库,以及更传统的AudioSet。我们的研究结果表明,FAST在ADIMA和AudioSet分类任务上都达到了最先进的性能,在某些情况下,它超过了现有的基准,同时使用的参数减少了150倍。
摘要:In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.
标题: MMVA:基于图像、音乐和音乐字幕之间的化合价和觉醒的多模式匹配
链接:https://arxiv.org/abs/2501.01094
备注:Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025
摘要:我们引入了基于效价和唤醒(MMVA)的多模式匹配,这是一个三模式编码器框架,旨在捕获图像、音乐和音乐字幕中的情感内容。为了支持这个框架,我们扩展了Image-Music-Decision-Matching-Net(IMEMNet)数据集,创建了IMEMNet-C,其中包括24,756张图像和25,944个带有相应音乐标题的音乐片段。我们采用基于连续效价(情绪积极性)和唤醒(情绪强度)值的多模态匹配分数。这种连续的匹配分数允许在训练期间通过计算来自不同模态的效价唤醒值的相似性分数来对图像-音乐对进行随机采样。因此,所提出的方法在效价唤醒预测任务中实现了最先进的性能。此外,该框架在各种zeroshot任务中证明了其有效性,突出了下游应用中效价和唤醒预测的潜力。
摘要:We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
标题: 推进新加坡式英语的理解:用数据集和多模式模型弥合差距
链接:https://arxiv.org/abs/2501.01034
备注:Open-Source: this https URL
摘要:新加坡式英语是一种源于英语的克里奥尔语,是多语言和多元文化背景下的语言学研究的重点。然而,它的口语形式仍然没有得到充分的探索,限制了对其语言结构和应用的深入了解。为了解决这个差距,我们标准化和注释最大的口语新加坡式英语语料库,介绍了多任务国家语音语料库(MNSC)。这些数据集支持多种任务,包括自动语音识别(ASR)、口语问答(SQA)、口语对话摘要(SDS)和副语言问答(PQA)。我们发布了标准化的分割和人类验证的测试集,以促进进一步的研究。此外,我们提出了SingAudioLLM,一个多任务多模态模型,利用多模态大型语言模型来同时处理这些任务。实验表明,我们的模型适应新加坡英语的背景下,实现国家的最先进的性能和优于现有的模型相比,其他AudioLLM和级联解决方案的10-30%。
摘要:Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.
标题: U-Gift:针对Few-Shot场景中有毒言语的不确定性引导防火墙
链接:https://arxiv.org/abs/2501.00907
备注:16 pages, 6 figures and 10 tables. Comments are welcome
摘要:随着社交媒体的广泛使用,在线平台上的用户生成内容激增。当此类内容包含仇恨、辱骂、攻击或网络欺凌行为时,它被归类为有毒言论,对在线生态系统的完整性和安全构成重大威胁。虽然手动内容审核仍然很普遍,但大量的内容和人类审核员的心理压力强调了自动有毒语音检测的必要性。以前提出的检测方法通常依赖于大型注释数据集;然而,获取这样的数据集在实践中既昂贵又具有挑战性。为了解决这个问题,我们提出了一个不确定性引导的防火墙有毒语音在Few-Shot的情况下,U-GIFT,利用自我训练,以提高检测性能,即使标记的数据是有限的。具体来说,U-GIFT将主动学习与贝叶斯神经网络(BNN)相结合,从未标记的数据中自动识别高质量的样本,根据模型预测得出的不确定性估计,优先选择具有更高置信度的伪标签进行训练。大量的实验表明,U-GIFT显着优于竞争对手的基线在Few-Shot检测场景。在5次拍摄设置中,它比基本模型实现了14.92%的性能改进。重要的是,U-GIFT是用户友好的,并可适应各种预训练的语言模型(PLM)。它还在样本不平衡和跨域设置的情况下表现出强大的性能,同时在各种语言应用程序中展示了强大的泛化能力。我们相信,U-GIFT为Few-Shot有毒语音检测提供了有效的解决方案,为网络空间的自动内容审核提供了实质性支持,从而充当了促进网络安全进步的防火墙。
摘要:With the widespread use of social media, user-generated content has surged on online platforms. When such content includes hateful, abusive, offensive, or cyberbullying behavior, it is classified as toxic speech, posing a significant threat to the online ecosystem's integrity and safety. While manual content moderation is still prevalent, the overwhelming volume of content and the psychological strain on human moderators underscore the need for automated toxic speech detection. Previously proposed detection methods often rely on large annotated datasets; however, acquiring such datasets is both costly and challenging in practice. To address this issue, we propose an uncertainty-guided firewall for toxic speech in few-shot scenarios, U-GIFT, that utilizes self-training to enhance detection performance even when labeled data is limited. Specifically, U-GIFT combines active learning with Bayesian Neural Networks (BNNs) to automatically identify high-quality samples from unlabeled data, prioritizing the selection of pseudo-labels with higher confidence for training based on uncertainty estimates derived from model predictions. Extensive experiments demonstrate that U-GIFT significantly outperforms competitive baselines in few-shot detection scenarios. In the 5-shot setting, it achieves a 14.92\% performance improvement over the basic model. Importantly, U-GIFT is user-friendly and adaptable to various pre-trained language models (PLMs). It also exhibits robust performance in scenarios with sample imbalance and cross-domain settings, while showcasing strong generalization across various language applications. We believe that U-GIFT provides an efficient solution for few-shot toxic speech detection, offering substantial support for automated content moderation in cyberspace, thereby acting as a firewall to promote advancements in cybersecurity.
标题: SoundBrush:声音作为视觉场景编辑的画笔
链接:https://arxiv.org/abs/2501.00645
备注:AAAI 2025
摘要:我们提出了SoundBrush,一种使用声音作为画笔来编辑和操纵视觉场景的模型。我们扩展了潜在扩散模型(LDM)的生成能力,将音频信息用于编辑视觉场景。受现有图像编辑工作的启发,我们将此任务视为监督学习问题,并利用各种现成的模型来构建声音配对的视觉场景数据集进行训练。这个丰富生成的数据集使SoundBrush能够学习将音频特征映射到LDM的文本空间中,从而允许在各种野外声音的指导下进行视觉场景编辑。与现有方法不同,SoundBrush可以准确地操纵整体场景,甚至插入发声对象,以最佳匹配音频输入,同时保留原始内容。此外,通过与新的视图合成技术相结合,我们的框架可以扩展到编辑3D场景,促进声音驱动的3D场景操作。演示可在https://soundbrush.github.io/上获得。
摘要:We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at https://soundbrush.github.io/.
标题: 使用口语训练和评估抑郁症风险模型的数据库大小要求
链接:https://arxiv.org/abs/2501.00617
备注:None
摘要:心理健康风险预测是言语社区中一个不断发展的领域,但许多研究都是基于小型语料库。本研究说明了试验和训练组尺寸的变化如何影响对照研究中的性能。使用超过65 K标记数据点的语料库,提供了不同训练/测试尺寸组合的完全交叉设计的结果。包括两种模型类型:一种基于语言,另一种基于语音声学。两者都使用该领域中的当前方法。还包括年龄不匹配的测试集。结果表明:(1)小于1 K样本的测试大小会产生噪声结果,即使对于较大的训练集大小也是如此;(2)稳定的结果需要至少2K的训练集大小;(3)NLP和声学模型的表现与训练/测试大小的变化相似,以及(4)不匹配的测试集显示出与匹配的测试集相同的模式。讨论了其他因素,包括标签先验,模型强度和预训练,独特的扬声器和数据长度。虽然没有单一的研究可以指定确切的尺寸要求,结果表明,需要适当大小的训练和测试集的心理健康风险预测从语音和语言的未来研究。
摘要:Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.
标题: Fotheidil:爱尔兰语自动转录系统
链接:https://arxiv.org/abs/2501.00509
备注:Accepted to the 5th Celtic Language Technology Workshop within COLING 2025
摘要:本文介绍了第一个基于网络的爱尔兰语转录系统- Fotheidil,该系统利用语音相关的人工智能技术作为ABAIR计划的一部分。该系统包括现成的预先训练的语音活动检测和扬声器diarisation模型和专门为爱尔兰自动语音识别和大写和标点符号恢复训练的模型。探索半监督学习来改进模块化TDNN-HMM ASR系统的声学模型,从而对在监督训练集中代表性不足的域外测试集和方言产生实质性改进。一种新的方法,涉及序列到序列模型的资本化和标点符号恢复与传统的方法相比,使用分类模型。实验结果表明,在这里也有很大的改善性能。该系统将免费供公众使用,是研究人员和其他转录爱尔兰语言材料的重要资源。当使用该系统时,将收集人工校正的transmittance并将其包含在训练数据集中,这将以周期性的,社区驱动的方式逐步改进ASR模型。
摘要:This paper sets out the first web-based transcription system for the Irish language - Fotheidil, a system that utilises speech-related AI technologies as part of the ABAIR initiative. The system includes both off-the-shelf pre-trained voice activity detection and speaker diarisation models and models trained specifically for Irish automatic speech recognition and capitalisation and punctuation restoration. Semi-supervised learning is explored to improve the acoustic model of a modular TDNN-HMM ASR system, yielding substantial improvements for out-of-domain test sets and dialects that are underrepresented in the supervised training set. A novel approach to capitalisation and punctuation restoration involving sequence-to-sequence models is compared with the conventional approach using a classification model. Experimental results show here also substantial improvements in performance. The system will be made freely available for public use, and represents an important resource to researchers and others who transcribe Irish language materials. Human-corrected transcriptions will be collected and included in the training dataset as the system is used, which should lead to incremental improvements to the ASR model in a cyclical, community-driven fashion.
标题: 展开创造性对抗网络,生成新颖的音乐片段
链接:https://arxiv.org/abs/2501.00452
摘要:近年来,音乐生成已成为人工智能和机器学习的一个重要主题。在最近的工作中,基于RNN的神经网络方法已被应用于序列生成。相比之下,生成对抗网络(GANs)及其对应物已经被很少的研究人员用于音乐生成。 在本文中,一个经典的系统被用来与一个新的系统,以产生创造性的音乐。这两个系统都是基于对抗网络设计的,通过从示例中学习来生成音乐。经典系统被训练来学习一组音乐作品,而不区分类别,而新系统被训练来学习不同的作曲家及其风格,以通过偏离所学习的作曲家的风格来生成创意音乐作品。 使用的基本结构是生成对抗网络(GANs),它能够在给定一组输入的情况下生成新的输出,以学习和模仿它们的分布。在以前的工作中已经表明,GANs在创造性输出方面的原始设计是有限的。基于创意对抗网络(CAN),这项工作将其应用于音乐领域,而不是视觉艺术领域。此外,还引入了展开的CAN以防止模式崩溃。在GAN和CAN上进行了用于生成音乐的实验,并且根据与输入集的偏差来测量它们的能力。
摘要:Music generation has been established as a prominent topic in artificial intelligence and machine learning over recent years. In most recent works on RNN-based neural network methods have been applied for sequence generation. In contrast, generative adversarial networks (GANs) and their counterparts have been explored by very few researchersfor music generation. In this paper, a classical system was employed alongside a new system to generate creative music. Both systems were designed based on adversarial networks to generate music by learning from examples. The classical system was trained to learn a set of music pieces without differentiating between classes, whereas the new system was trained to learn the different composers and their styles to generate a creative music piece by deviating from the learned composers' styles. The base structure utilized was generative adversarial networks (GANs), which are capable of generating novel outputs given a set of inputs to learn from and mimic their distribution. It has been shown in previous work that GANs are limited in their original design with respect to creative outputs. Building on the Creative Adversarial Networks (CAN) , this work applied them in the music domain rather than the visual art domain. Additionally, unrolled CAN was introduced to prevent mode collapse. Experiments were conducted on both GAN and CAN for generating music, and their capabilities were measured in terms of deviation from the input set.
标题: Whisper变得更强大:增强Wav2Vec 2.0以在低资源语言中实现卓越的ASB
链接:https://arxiv.org/abs/2501.00425
备注:15 pagesm 3 figures
摘要:在低资源语言中处理语音到文本和自动语音识别问题是众所周知的挑战,因为缺乏经过验证的数据集和方言的多样性。阿拉伯语、俄语和葡萄牙语就是这些困难的例子,由于这些语言在世界不同大陆有许多方言,因此它们是资源贫乏的语言。此外,这些语言的口音和发音的多样性使ASR模型的成功变得复杂。随着深度学习和Transformers的日益普及,与最先进的方法相比,著名的Wav2Vec2等声学模型在语音识别领域取得了卓越的性能。然而,尽管Wav2Vec2比传统方法提高了效率,但对于代表性不足的语言,其性能显着下降,即使它需要的标记数据显着减少。本文介绍了一个端到端的框架,通过数据增强技术,增强了ASR系统在Wav2Vec2上的微调。为了验证我们的框架的有效性,我们使用Mozilla的阿拉伯语,俄语和葡萄牙语的Common Voice项目的三个数据集进行了详细的实验评估。此外,本文提出的框架证明了不同的变音符号的鲁棒性。最终,我们的方法优于之前的两个基线模型,即预训练的Wav2Vec2和众所周知的Whisper ASR模型,导致单词错误率平均相对改善33.9%,字符错误率相对改善53.2%。
摘要:Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models' success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2's improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework's effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla's Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9\% in Word Error Rate and a 53.2\% relative improvement in Character Error Rate.
标题: TSPE:用于改进Zero-Shot音频分类的特定任务提示集合
链接:https://arxiv.org/abs/2501.00398
备注:5 pages
摘要:音频语言模型(ALM)擅长zero-shot音频分类,这是一项任务,其中模型通过利用描述性自然语言提示在测试时对先前未看到的音频片段进行分类。我们介绍了TSPE(任务特定提示Entrance),一个简单的,无训练的硬提示方法,提高ALE的zero-shot性能,通过定制不同的音频分类任务的提示。我们不是使用“汽车的声音”等基于通用模板的提示,而是生成上下文丰富的提示,例如“来自隧道的汽车的声音”。具体来说,我们利用标签信息来识别合适的声音属性,如“响亮”和“微弱”,以及适当的声源,如“隧道”和“街道”,并将此信息纳入音频语言模型(ALM)用于音频分类的提示中。此外,为了增强音频文本对齐,我们在TSPE生成的特定任务提示中执行提示集成。当在12个不同的音频分类数据集上进行评估时,TSPE比普通的零次(vanilla zero-shot)评估显示出1.23-16.36%的绝对改善,从而提高了整个ALM的性能。
摘要:Audio-language models (ALMs) excel in zero-shot audio classification, a task where models classify previously unseen audio clips at test time by leveraging descriptive natural language prompts. We introduce TSPE (Task-Specific Prompt Ensemble), a simple, training-free hard prompting method that boosts ALEs' zero-shot performance by customizing prompts for diverse audio classification tasks. Rather than using generic template-based prompts like "Sound of a car" we generate context-rich prompts, such as "Sound of a car coming from a tunnel". Specifically, we leverage label information to identify suitable sound attributes, such as "loud" and "feeble", and appropriate sound sources, such as "tunnel" and "street" and incorporate this information into the prompts used by Audio-Language Models (ALMs) for audio classification. Further, to enhance audio-text alignment, we perform prompt ensemble across TSPE-generated task-specific prompts. When evaluated on 12 diverse audio classification datasets, TSPE improves performance across ALMs by showing an absolute improvement of 1.23-16.36% over vanilla zero-shot evaluation.
标题: 用于语音分类的尖峰神经网络中的时间信息重建和非对齐残留
链接:https://arxiv.org/abs/2501.00348
备注:9 pages, 5 figures
摘要:近年来,人们注意到大多数基于脉冲神经网络(spiking neural networks,SNNs)的模型在处理语音分类问题时,都只使用同一时间尺度的时间分辨率,这使得这些模型不能学习不同时间尺度上的输入数据信息。此外,由于许多模型的子模块前后数据的时间长度不同,无法应用有效的剩余连接来优化这些模型的训练过程。为了解决这些问题,一方面,本文提出了一种新的时域重构方法,即时域重构(TR)方法,通过参考人脑理解语音的分层处理过程。然后,重建的SNN模型与TR可以学习输入数据在不同的时间尺度上的信息和模型更全面的语义信息,从音频数据,因为它使网络学习输入数据在不同的时间分辨率的信息。另一方面,我们通过分析音频数据,提出了非对齐残差(NAR)方法,它允许残差连接可以用于两个不同时间长度的音频数据。我们在Spiking Speech Commands(SSC)、Spiking Heidelberg Digits(SHD)和Google Speech Commands v0.02(GSC)数据集上进行了大量的实验。实验结果表明,所有SNN模型的测试分类准确率在SSC上达到了SOTA的81.02%,在SHD上达到了SOTA的96.04%。
摘要:Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these models.To solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02\% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04\% on SHD for the classification accuracy of all models.
标题: Vox Vietnam:用于越南语说话人识别的大规模多流派数据集
链接:https://arxiv.org/abs/2501.00328
备注:Accepted to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
摘要:最近的研究在说话人识别的目的是解决由于注册和测试的话语之间的变化,特别是在多体裁的现象,其中的话语是在不同的语音体裁的漏洞。以前的资源越南语说话人识别要么是有限的规模或不注重体裁多样性,离开研究多体裁的影响未探索。本文介绍了VoxVietnam,这是第一个用于越南语说话人识别的多体裁数据集,包含来自1,406个说话人的187,000多个话语,以及一个自动化管道,用于从公共资源中大规模构建数据集。我们的实验显示了多流派现象对在单流派数据集上训练的模型所带来的挑战,并证明了将VoxVietnam纳入训练过程后性能的显着提高。我们的实验进行了研究的挑战,多体裁现象在说话人识别和性能增益时,所提出的数据集用于多体裁训练。
摘要:Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study the challenges of the multi-genre phenomenon in speaker recognition and the performance gain when the proposed dataset is used for multi-genre training.
标题: 语音评估分类器的集成
链接:https://arxiv.org/abs/2501.00067
摘要:本文描述了一种尝试,应用集成的二进制分类器来解决医学语音评估的问题。一个数据集的基础上编制的定量和专家评估音节发音质量。选取动态时间扭曲距离、Minkowski距离、相关系数、最长公共子序列(LCSS)、实序列编辑距离(EDR)、实惩罚编辑距离(ERP)和合并分裂(MSM)7个指标作为特征。专家对语音质量的评估被用作类别标签:1级表示高质量语音,0级表示失真。对逻辑回归(LR)、支持向量机(SVM)、朴素贝叶斯(NB)、决策树(DT)和K-最近邻(KNN)五种分类方法的训练结果进行了比较。使用的混合物的方法来建立一个集成的分类器的结果也被提出。与使用单独的二进制分类器相比,对所研究的数据集使用一个扩展器使我们能够稍微提高分类精度。
摘要:The article describes an attempt to apply an ensemble of binary classifiers to solve the problem of speech assessment in medicine. A dataset was compiled based on quantitative and expert assessments of syllable pronunciation quality. Quantitative assessments of 7 selected metrics were used as features: dynamic time warp distance, Minkowski distance, correlation coefficient, longest common subsequence (LCSS), edit distance of real se-quence (EDR), edit distance with real penalty (ERP), and merge split (MSM). Expert as-sessment of pronunciation quality was used as a class label: class 1 means high-quality speech, class 0 means distorted. A comparison of training results was carried out for five classification methods: logistic regression (LR), support vector machine (SVM), naive Bayes (NB), decision trees (DT), and K-nearest neighbors (KNN). The results of using the mixture method to build an ensemble of classifiers are also presented. The use of an en-semble for the studied data sets allowed us to slightly increase the classification accuracy compared to the use of individual binary classifiers.
标题: Lungmix:一种基于混合的呼吸声分类推广策略
链接:https://arxiv.org/abs/2501.00064
备注:4pages, 3 figures, conference paper
摘要:呼吸音分类在呼吸系统疾病的诊断中起着至关重要的作用。虽然深度学习模型在各种呼吸声数据集上取得了成功,但我们的实验表明,在一个数据集上训练的模型通常无法有效地推广到其他数据集,这主要是由于数据收集和注释不一致。为了解决这一限制,我们引入了\n {Lungmix},这是一种受Mixup启发的新型数据增强技术。Lungmix通过使用响度和随机掩码混合波形来生成增强数据,同时根据它们的语义插入标签,帮助模型学习更一般化的表示。对ICBHI、SPR和HF三个数据集的综合评估表明,Lungmix显著增强了模型对未知数据的泛化能力。特别是,Lungmix将4类分类得分提高了3.55%,实现了与直接在目标数据集上训练的模型相当的性能。
摘要:Respiratory sound classification plays a pivotal role in diagnosing respiratory diseases. While deep learning models have shown success with various respiratory sound datasets, our experiments indicate that models trained on one dataset often fail to generalize effectively to others, mainly due to data collection and annotation \emph{inconsistencies}. To address this limitation, we introduce \emph{Lungmix}, a novel data augmentation technique inspired by Mixup. Lungmix generates augmented data by blending waveforms using loudness and random masks while interpolating labels based on their semantic meaning, helping the model learn more generalized representations. Comprehensive evaluations across three datasets, namely ICBHI, SPR, and HF, demonstrate that Lungmix significantly enhances model generalization to unseen data. In particular, Lungmix boosts the 4-class classification score by up to 3.55\%, achieving performance comparable to models trained directly on the target dataset.
标题: 基于声音的触摸手势和情绪识别以增强人机交互
链接:https://arxiv.org/abs/2501.00038
备注:ICASSP 2025
摘要:情感识别和触摸手势解码对于推进人机交互(HRI)至关重要,特别是在情感线索和触觉感知发挥重要作用的社交环境中。然而,许多类人机器人,如Pepper,Nao和Furhat,缺乏全身触觉皮肤,限制了它们参与基于触摸的情感和手势交互的能力。此外,由于需要收集个人面部数据,基于视觉的情感识别方法通常面临严格的GDPR合规性挑战。为了解决这些局限性,避免隐私问题,本文研究了使用触摸HRI过程中产生的声音来识别触觉手势,并沿着唤醒和效价维度对情绪进行分类的潜力。使用来自28名参与者与人形机器人Pepper的触觉手势和情感交互的数据集,我们设计了一个仅含音频的轻量级触摸手势和情感识别模型,参数仅为0.24M,模型大小为0.94MB,FLOPs为0.7G。实验结果表明,当输入音频长度变化时,所提出的基于声音的触摸手势和情感识别模型可以有效地识别不同情感以及各种触摸手势的唤醒和效价状态。提出的模型具有低延迟,并实现了与众所周知的预训练音频神经网络(PANN)类似的结果,但FLOP、参数和模型大小要小得多。
摘要:Emotion recognition and touch gesture decoding are crucial for advancing human-robot interaction (HRI), especially in social environments where emotional cues and tactile perception play important roles. However, many humanoid robots, such as Pepper, Nao, and Furhat, lack full-body tactile skin, limiting their ability to engage in touch-based emotional and gesture interactions. In addition, vision-based emotion recognition methods usually face strict GDPR compliance challenges due to the need to collect personal facial data. To address these limitations and avoid privacy issues, this paper studies the potential of using the sounds produced by touching during HRI to recognise tactile gestures and classify emotions along the arousal and valence dimensions. Using a dataset of tactile gestures and emotional interactions from 28 participants with the humanoid robot Pepper, we design an audio-only lightweight touch gesture and emotion recognition model with only 0.24M parameters, 0.94MB model size, and 0.7G FLOPs. Experimental results show that the proposed sound-based touch gesture and emotion recognition model effectively recognises the arousal and valence states of different emotions, as well as various tactile gestures, when the input audio length varies. The proposed model is low-latency and achieves similar results as well-known pretrained audio neural networks (PANNs), but with much smaller FLOPs, parameters, and model size.
标题: SECodec:用于语音语言模型的基于结构信息的压缩语音表示编解码器
链接:https://arxiv.org/abs/2501.00018
备注:Accepted to the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)
摘要:随着大型语言模型(LLM)的快速发展,离散语音表示已成为将语音集成到LLM中的关键。现有的语音表示离散化方法依赖于预定义的码本大小和基于欧氏距离的量化。然而,1)码本的大小是影响编解码器性能和下游任务训练效率两者的关键参数。2)当码本的大小被控制在合理的范围内时,基于欧几里德距离的量化可能导致音频失真。事实上,在信息压缩领域,结构信息和熵指导是至关重要的,但以前的方法在很大程度上忽略了这些因素。因此,我们从信息论的角度来解决上述问题,我们提出了SECodec,一种新的语音表示编解码器的基础上结构熵(SE)的语音语言模型的建设。具体来说,我们首先将语音建模为一个图,聚类图中的语音特征节点,并通过分层和分解最小化2D SE来提取相应的码本。然后,为了解决音频失真的问题,我们提出了一种新的量化方法。该方法仍然坚持二维SE最小化原则,自适应地为每个输入的原始语音节点选择最合适的令牌对应的集群。此外,我们开发了一个基于结构熵的语音语言模型(SESLM),利用SECodec。实验结果表明,SECodec在语音重建中优于EnCodec,SESLM在zero-shot文本到语音转换任务中优于VALL-E。代码、演示语音、语音特征图、SE码本和模型可在https://github.com/wlq2019/SECodec上获得。
摘要:With the rapid advancement of large language models (LLMs), discrete speech representations have become crucial for integrating speech into LLMs. Existing methods for speech representation discretization rely on a predefined codebook size and Euclidean distance-based quantization. However, 1) the size of codebook is a critical parameter that affects both codec performance and downstream task training efficiency. 2) The Euclidean distance-based quantization may lead to audio distortion when the size of the codebook is controlled within a reasonable range. In fact, in the field of information compression, structural information and entropy guidance are crucial, but previous methods have largely overlooked these factors. Therefore, we address the above issues from an information-theoretic perspective, we present SECodec, a novel speech representation codec based on structural entropy (SE) for building speech language models. Specifically, we first model speech as a graph, clustering the speech features nodes within the graph and extracting the corresponding codebook by hierarchically and disentangledly minimizing 2D SE. Then, to address the issue of audio distortion, we propose a new quantization method. This method still adheres to the 2D SE minimization principle, adaptively selecting the most suitable token corresponding to the cluster for each incoming original speech node. Furthermore, we develop a Structural Entropy-based Speech Language Model (SESLM) that leverages SECodec. Experimental results demonstrate that SECodec performs comparably to EnCodec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks. Code, demo speeches, speech feature graph, SE codebook, and models are available at https://github.com/wlq2019/SECodec.
标题: 变化声环境下房间脉冲响应的灵敏度
链接:https://arxiv.org/abs/2501.01206
摘要:室内声学的变化,例如表面吸收的修改或散射物体的插入,会显著影响测量的室内脉冲响应(RIR)。这些变化可能会影响回声消除和主动声学系统的性能,并支持导航和目标跟踪等任务。因此,识别和量化这些变化对于推进基于室内声学的技术至关重要。本研究介绍一种通过评估连续记录的RIR的相似性来分析声环境变化的方法。采用短时相干性来表征修改,包括墙壁吸收的变化或房间中移动的人的存在。敏感度评级进一步用于量化这些变化的幅度。结果清楚地区分了不同类型的修改-大气变化,吸收变化和人类存在。所描述的方法提供了一种新的方法来分析和解释房间声学,强调RIR相似性和提取信息的时间和频谱信号属性。
摘要:Changes in room acoustics, such as modifications to surface absorption or the insertion of a scattering object, significantly impact measured room impulse responses (RIRs). These changes can affect the performance of systems used in echo cancellation and active acoustics and support tasks such as navigation and object tracking. Recognizing and quantifying such changes is, therefore, critical for advancing technologies based on room acoustics. This study introduces a method for analyzing acoustic environment changes by evaluating the similarity of consecutively recorded RIRs. Short-time coherence is employed to characterize modifications, including changes in wall absorption or the presence of a moving person in the room. A sensitivity rating is further used to quantify the magnitude of these changes. The results clearly differentiate between types of modifications -- atmospheric variation, changes in absorption, and human presence. The methods described provide a novel approach to analyzing and interpreting room acoustics, emphasizing RIR similarity and extracting information from temporal and spectral signal properties.
标题: 利用中心损失从谱图中学习区分特征进行语音情感识别
链接:https://arxiv.org/abs/2501.01103
备注:Accepted at ICASSP 2019
摘要:从语音中识别情绪状态对于机器与说话者的自然交互至关重要。然而,提取有效的情感识别特征是困难的,因为情感是模糊的。我们提出了一种新的方法,通过将softmax交叉熵损失和中心损失结合起来,从可变长度谱图中学习区分特征以进行情感识别。softmax交叉熵损失使不同情感类别的特征可分离,中心损失有效地将属于同一情感类别的特征拉到它们的中心。通过将这两种损失结合在一起,区分能力将大大增强,从而使网络学习更有效的情感识别特征。实验结果表明,引入中心损失后,在Mel谱输入下,未加权精度和加权精度均提高了3%以上,在短时傅里叶变换谱输入下,提高了4%以上。
摘要:Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3\% on Mel-spectrogram input, and more than 4\% on Short Time Fourier Transform spectrogram input.
标题: 预训练BERT提取语义特征的端到端框架中中文多音歧义消除
链接:https://arxiv.org/abs/2501.01102
备注:Accepted at INTERSPEECH 2019
摘要:形音转换是汉语文语转换系统的重要组成部分,多音字消歧是其核心问题。在本文中,我们提出了一个端到端的框架来预测多音字的发音,它接受句子包含多音字作为输入的汉字序列的形式,而不需要任何预处理。该方法由一个预训练的双向编码器表示从Transformers(BERT)模型和一个神经网络(NN)为基础的分类器。预训练的BERT模型从原始汉字序列中提取语义特征,基于神经网络的分类器根据BERT的输出预测多音字的发音。在我们的实验中,我们实现了三个分类器,一个基于全连接网络的分类器,一个基于长短期记忆(LSTM)网络的分类器和一个基于Transformer块的分类器。实验结果表明,与基于LSTM的基线方法相比,预训练模型提取了有效的语义特征,大大提高了多音字消歧的性能。此外,我们还探讨了语境信息对多音字消歧的影响。
摘要:Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation.
标题: SIDE:将语音语言模型与LLM集成,以实现自发口语对话生成
链接:https://arxiv.org/abs/2501.00805
备注:Accepted by ICASSP 2025
摘要:近年来,基于语音单元的“无文本”语音语言模型(SLM)在生成自然语音(包括非言语发声)方面取得了巨大进展。然而,生成的语音样本往往缺乏语义连贯性。在本文中,我们提出了SLM和LLM集成的自发口语对话生成(SLIDE)。具体来说,我们首先利用LLM生成口语对话的文本内容。接下来,我们将文本对话转换为音素序列,并使用基于双塔变换器的持续时间预测器来预测每个音素的持续时间。最后,一个SLM的条件下的语音音素序列被用来发声的文本对话。Fisher数据集上的实验结果表明,我们的系统可以生成自然主义的口语对话,同时保持高度的语义一致性。
摘要:Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
标题: VoiceRestore:用于语音记录质量恢复的流匹配变形机
链接:https://arxiv.org/abs/2501.00794
摘要:我们提出了VoiceRestore,一种新的方法来恢复语音录音的质量,使用流匹配Transformers训练在一个自我监督的方式对合成数据。我们的方法解决了在短格式和长格式语音记录中经常发现的各种退化,包括背景噪声,混响,压缩伪影和带宽限制-所有这些都在一个统一的模型中。利用条件流匹配和无分类器指导,该模型学习将降级语音映射到高质量录音,而不需要配对的干净和降级数据集。我们描述了训练过程,条件流匹配框架,和模型的架构。我们还展示了该模型的泛化到现实世界的语音恢复任务,包括短话语和扩展的独白或对话。定性和定量的评估表明,我们的方法提供了一个灵活和有效的解决方案,提高质量的语音记录在不同的长度和退化类型。
摘要:We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including background noise, reverberation, compression artifacts, and bandwidth limitations - all within a single, unified model. Leveraging conditional flow matching and classifier free guidance, the model learns to map degraded speech to high quality recordings without requiring paired clean and degraded datasets. We describe the training process, the conditional flow matching framework, and the model's architecture. We also demonstrate the model's generalization to real-world speech restoration tasks, including both short utterances and extended monologues or dialogues. Qualitative and quantitative evaluations show that our approach provides a flexible and effective solution for enhancing the quality of speech recordings across varying lengths and degradation types.
标题: 解决语音认知障碍检测问题:向Process挑战提交
链接:https://arxiv.org/abs/2501.00145
摘要:这项工作描述了我们小组提交的PROCESS Challenge 2024,其目标是使用三个指导性临床任务通过自发言语评估认知下降。这项联合努力遵循了一种整体方法,包括基于知识的声学和基于文本的特征集,以及基于LLM的宏观语言学描述符,基于停顿的声学生物标志物和多种神经表示(例如,LongFormer、ECAPA-TDNN和Trillson嵌入)。将这些特征集与不同的分类器相结合,产生了大量的模型,我们从中选择了那些在训练、开发和单个类性能之间提供最佳平衡的模型。我们的研究结果表明,我们表现最好的系统对应于相互补充的模型组合,依赖于所有三个临床任务的声学和文本信息。
摘要:This work describes our group's submission to the PROCESS Challenge 2024, with the goal of assessing cognitive decline through spontaneous speech, using three guided clinical tasks. This joint effort followed a holistic approach, encompassing both knowledge-based acoustic and text-based feature sets, as well as LLM-based macrolinguistic descriptors, pause-based acoustic biomarkers, and multiple neural representations (e.g., LongFormer, ECAPA-TDNN, and Trillson embeddings). Combining these feature sets with different classifiers resulted in a large pool of models, from which we selected those that provided the best balance between train, development, and individual class performance. Our results show that our best performing systems correspond to combinations of models that are complementary to each other, relying on acoustic and textual information from all three clinical tasks.
标题: DiCoW:用于目标说话人自动语音识别的日记化条件耳语器
链接:https://arxiv.org/abs/2501.00114
摘要:多说话者环境中的说话者属性自动语音识别(ASR)仍然是一个重大挑战,特别是当以说话者嵌入为条件的系统无法推广到看不见的说话者时。在这项工作中,我们提出了Diarization-Conditioned Whisper(DiCoW),这是一种利用扬声器Diarization输出作为条件信息的目标扬声器ASR的新方法。DiCoW通过直接集成日记标签来扩展预训练的Whisper模型,消除了对说话者嵌入的依赖,并减少了对大量特定于说话者的训练数据的需求。我们的方法引入了帧级依赖于diarization-dependent变换(FDDT)和查询键偏置(QKb)技术,以改进模型对目标说话者的关注,同时有效地处理重叠语音。通过利用日记化输出作为调节信号,DiCoW简化了多扬声器ASR的工作流程,提高了对未见过扬声器的泛化能力,并在现实世界的多扬声器录音中实现了更可靠的转录。此外,我们探讨了一个连接主义的时间分类(CTC)头耳语的整合,并证明了它的能力,以提高转录效率,通过混合解码。值得注意的是,我们表明,我们的方法不仅限于耳语;它也提供了类似的好处时,应用于Branchformer模型。我们在真实世界的数据集上验证了DiCoW,包括来自CHiME-8挑战的AMI和NOTSOFAR-1,以及Libri 2 Mix和LibriCSS等合成基准,从而可以与以前的方法进行直接比较。结果表明,DiCoW增强了模型的目标扬声器ASR功能,同时保持Whisper对单扬声器数据的准确性和鲁棒性。
摘要:Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.
标题: 使用强化学习适应混乱语音的LLM语音识别
链接:https://arxiv.org/abs/2501.00039
备注:Accepted at ICASSP 2025
摘要:我们介绍了一个能够处理语音输入的大型语言模型(LLM),并表明通过对人类偏好的强化学习(RLHF)对其进行进一步调整,使其能够比传统的微调更好地适应无序语音。我们的方法取代低频文本令牌在LLM的词汇与音频令牌,并使该模型能够识别语音微调语音与成绩单。然后,我们使用RL与奖励的基础上,句法和语义的准确性措施推广的LLM进一步识别无序的语音。虽然由此产生的LLM并没有优于现有的语音识别系统,但我们发现,使用自定义奖励的强化学习调整比语言模型的监督微调带来了更好的性能,特别是在适应不同环境中的语音时。这为使用大型语言模型的语音识别提供了一种引人注目的替代调优策略。
摘要:We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
标题: VoiceVector:用于说话者分离的多模式集合Vector
链接:https://arxiv.org/abs/2501.01401
备注:None
摘要:我们提出了一个基于变压器的架构,从多个其他扬声器和环境噪声的目标扬声器的语音分离。我们通过使用两个独立的神经网络来实现这一点:(A)一个注册网络,旨在制作特定于说话者的嵌入,利用音频和视觉模态的各种组合;(B)一个分离网络,接受噪声信号和注册向量作为输入,输出目标说话者的干净信号。这些新奇之处是:(i)登记向量可以从以下数据中产生:仅音频、视听数据(使用嘴唇运动)或仅视觉数据(使用来自无声视频的嘴唇运动);以及(ii)在多个正登记向量和负登记向量上调节分离的灵活性。我们与以前的方法进行比较,并获得优越的性能。
摘要:We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.
标题: 变化声环境下房间脉冲响应的灵敏度
链接:https://arxiv.org/abs/2501.01206
摘要:室内声学的变化,例如表面吸收的修改或散射物体的插入,会显著影响测量的室内脉冲响应(RIR)。这些变化可能会影响回声消除和主动声学系统的性能,并支持导航和目标跟踪等任务。因此,识别和量化这些变化对于推进基于室内声学的技术至关重要。本研究介绍一种通过评估连续记录的RIR的相似性来分析声环境变化的方法。采用短时相干性来表征修改,包括墙壁吸收的变化或房间中移动的人的存在。敏感度评级进一步用于量化这些变化的幅度。结果清楚地区分了不同类型的修改-大气变化,吸收变化和人类存在。所描述的方法提供了一种新的方法来分析和解释房间声学,强调RIR相似性和提取信息的时间和频谱信号属性。
摘要:Changes in room acoustics, such as modifications to surface absorption or the insertion of a scattering object, significantly impact measured room impulse responses (RIRs). These changes can affect the performance of systems used in echo cancellation and active acoustics and support tasks such as navigation and object tracking. Recognizing and quantifying such changes is, therefore, critical for advancing technologies based on room acoustics. This study introduces a method for analyzing acoustic environment changes by evaluating the similarity of consecutively recorded RIRs. Short-time coherence is employed to characterize modifications, including changes in wall absorption or the presence of a moving person in the room. A sensitivity rating is further used to quantify the magnitude of these changes. The results clearly differentiate between types of modifications -- atmospheric variation, changes in absorption, and human presence. The methods described provide a novel approach to analyzing and interpreting room acoustics, emphasizing RIR similarity and extracting information from temporal and spectral signal properties.
标题: 利用中心损失从谱图中学习区分特征进行语音情感识别
链接:https://arxiv.org/abs/2501.01103
备注:Accepted at ICASSP 2019
摘要:从语音中识别情绪状态对于机器与说话者的自然交互至关重要。然而,提取有效的情感识别特征是困难的,因为情感是模糊的。我们提出了一种新的方法来学习判别特征的可变长度的谱图的情感识别合作softmax交叉熵损失和中心损失在一起。softmax交叉熵损失使不同情感类别的特征可分离,中心损失有效地将属于同一情感类别的特征拉到它们的中心。通过将这两种损失结合在一起,区分能力将大大增强,这导致网络学习更有效的情感识别特征。实验结果表明,引入中心损失后,在Mel谱输入下,未加权精度和加权精度均提高了3%以上,在短时傅里叶变换谱输入下,提高了4%以上。
摘要:Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3\% on Mel-spectrogram input, and more than 4\% on Short Time Fourier Transform spectrogram input.
标题: 预训练BERT提取语义特征的端到端框架中中文多音歧义消除
链接:https://arxiv.org/abs/2501.01102
备注:Accepted at INTERSPEECH 2019
摘要:形音转换是汉语文语转换系统的重要组成部分,多音字消歧是其核心问题。在本文中,我们提出了一个端到端的框架来预测多音字的发音,它接受句子包含多音字作为输入的汉字序列的形式,而不需要任何预处理。该方法由一个预训练的双向编码器表示从Transformers(BERT)模型和一个神经网络(NN)为基础的分类器。预训练的BERT模型从原始汉字序列中提取语义特征,基于神经网络的分类器根据BERT的输出预测多音字的发音。在我们的实验中,我们实现了三个分类器,一个基于全连接网络的分类器,一个基于长短期记忆(LSTM)网络的分类器和一个基于Transformer块的分类器。实验结果表明,与基于LSTM的基线方法相比,预训练模型提取了有效的语义特征,大大提高了多音字消歧的性能。此外,我们还探讨了语境信息对多音字消歧的影响。
摘要:Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation.
标题: 到达时间差源定位:一般3D问题的精确线性解
链接:https://arxiv.org/abs/2501.01076
备注:5 pages, 2 figures
摘要:到达时间差(TDOA)问题承认确切的,纯代数的解决方案的情况下,其中有4个和5个传感器和一个单一的源,其位置是在3维确定。解在没有最小平方运算的意义上是精确的(即,(1)参与解决方案。这些解不涉及线性化或迭代,并且通过笛卡尔坐标中的向量代数是代数透明的。具有5个传感器的解决方案不需要解决符号模糊性;具有4个传感器的解决方案需要解决一个符号模糊性。解决方案仅使用TDOA而不使用,例如,到达频差(FDOA)或到达角(AOA)。 我们首先介绍5传感器解决方案,然后介绍4传感器方案。在得出结论之前,我们进行了数值实验,展示了在没有噪音的情况下计算的性能。计算的性能在数值误差内是精确的,并且在源定位不发生的小部分情况下,它是由在没有先验的情况下解决符号模糊性的错误识别驱动的。因此,我们认为,计算速度快,准确性高,具有很大的实际效用。
摘要:The time difference of arrival (TDOA) problem admits exact, purely algebraic solutions for the situation in which there are 4 and 5 sensors and a single source whose position is to be determined in 3 dimensions. The solutions are exact in the sense that there is no least squares operation (i.e., projection) involved in the solution. The solutions involve no linearization or iteration, and are algebraically transparent via vector algebra in Cartesian coordinates. The solution with 5 sensors requires no resolution of sign ambiguities; the solution with 4 sensors requires resolution of one sign ambiguity. Solutions are effected using only TDOA and not, e.g., frequency difference of arrival (FDOA) or angle of arrival (AOA). We first present the 5-sensor solution and then follow with the 4-sensor scenario. Numerical experiments are presented showing the performance of the calculations in the case of no noise, before closing with conclusions. Performance of the calculations is exact within numerical error, and in the small fraction of cases in which source localization does not occur, it is driven by misidentification in resolution of sign ambiguity without priors. We therefore believe the calculations have substantial practical utility for their speed and exactness.
标题: SIDE:将语音语言模型与LLM集成,以实现自发口语对话生成
链接:https://arxiv.org/abs/2501.00805
备注:Accepted by ICASSP 2025
摘要:近年来,基于语音单元的“无文本”语音语言模型(SLM)在生成自然语音(包括非言语发声)方面取得了巨大进展。然而,生成的语音样本往往缺乏语义连贯性。在本文中,我们提出了SLM和LLM集成的自发口语对话生成(SLIDE)。具体来说,我们首先利用LLM生成口语对话的文本内容。接下来,我们将文本对话转换为音素序列,并使用基于双塔变换器的持续时间预测器来预测每个音素的持续时间。最后,一个SLM的条件下的语音音素序列被用来发声的文本对话。Fisher数据集上的实验结果表明,我们的系统可以生成自然主义的口语对话,同时保持高度的语义一致性。
摘要:Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
标题: 文本发音相关自动生成及其在上下文偏置中的应用
链接:https://arxiv.org/abs/2501.00804
备注:Accepted by ICASSP 2025
摘要:有效区分不同书面语篇之间的语音相关性是语言声学研究的一个重要课题。传统上,这种发音相关性是通过人工设计的发音词典获得的。在本文中,我们提出了一种数据驱动的方法来自动获取这些发音相关性,称为自动文本发音相关性(ATPC)。该方法所需的监督与训练端到端自动语音识别(E2 E-ASR)系统所需的监督一致,即,语音和相应的文本注释。首先,迭代训练的时间戳估计器(ITSE)算法被用来对齐语音与它们相应的注释文本符号。然后,使用语音编码器将语音转换成语音嵌入。最后,我们比较不同文字符号的语音嵌入距离,以获得ATPC。普通话实验结果表明,ATPC增强了E2 E-ASR在上下文偏置方面的性能,并为方言或缺乏人工发音词典的语言带来了希望。
摘要:Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.
标题: VoiceRestore:用于语音记录质量恢复的流匹配变形机
链接:https://arxiv.org/abs/2501.00794
摘要:我们提出了VoiceRestore,一种新的方法来恢复语音录音的质量,使用流匹配Transformers训练在一个自我监督的方式对合成数据。我们的方法解决了在短格式和长格式语音记录中经常发现的各种退化,包括背景噪声,混响,压缩伪影和带宽限制-所有这些都在一个统一的模型中。利用条件流匹配和无分类器指导,该模型学习将降级语音映射到高质量录音,而不需要配对的干净和降级数据集。我们描述了训练过程,条件流匹配框架,和模型的架构。我们还展示了该模型的泛化到现实世界的语音恢复任务,包括短话语和扩展的独白或对话。定性和定量的评估表明,我们的方法提供了一个灵活和有效的解决方案,提高质量的语音记录在不同的长度和退化类型。
摘要:We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including background noise, reverberation, compression artifacts, and bandwidth limitations - all within a single, unified model. Leveraging conditional flow matching and classifier free guidance, the model learns to map degraded speech to high quality recordings without requiring paired clean and degraded datasets. We describe the training process, the conditional flow matching framework, and the model's architecture. We also demonstrate the model's generalization to real-world speech restoration tasks, including both short utterances and extended monologues or dialogues. Qualitative and quantitative evaluations show that our approach provides a flexible and effective solution for enhancing the quality of speech recordings across varying lengths and degradation types.
标题: 解决语音认知障碍检测问题:向Process挑战提交
链接:https://arxiv.org/abs/2501.00145
摘要:这项工作描述了我们小组提交的PROCESS Challenge 2024,其目标是使用三个指导性临床任务通过自发言语评估认知下降。这项联合努力遵循了一种整体方法,包括基于知识的声学和基于文本的特征集,以及基于LLM的宏观语言学描述符,基于停顿的声学生物标志物和多种神经表示(例如,LongFormer、ECAPA-TDNN和Trillson嵌入)。将这些特征集与不同的分类器相结合,产生了大量的模型,我们从中选择了那些在训练、开发和单个类性能之间提供最佳平衡的模型。我们的研究结果表明,我们表现最好的系统对应于相互补充的模型组合,依赖于所有三个临床任务的声学和文本信息。
摘要:This work describes our group's submission to the PROCESS Challenge 2024, with the goal of assessing cognitive decline through spontaneous speech, using three guided clinical tasks. This joint effort followed a holistic approach, encompassing both knowledge-based acoustic and text-based feature sets, as well as LLM-based macrolinguistic descriptors, pause-based acoustic biomarkers, and multiple neural representations (e.g., LongFormer, ECAPA-TDNN, and Trillson embeddings). Combining these feature sets with different classifiers resulted in a large pool of models, from which we selected those that provided the best balance between train, development, and individual class performance. Our results show that our best performing systems correspond to combinations of models that are complementary to each other, relying on acoustic and textual information from all three clinical tasks.
标题: DiCoW:用于目标说话人自动语音识别的日记化条件耳语器
链接:https://arxiv.org/abs/2501.00114
摘要:多说话者环境中的说话者属性自动语音识别(ASR)仍然是一个重大挑战,特别是当以说话者嵌入为条件的系统无法推广到看不见的说话者时。在这项工作中,我们提出了Diarization-Conditioned Whisper(DiCoW),这是一种利用扬声器Diarization输出作为条件信息的目标扬声器ASR的新方法。DiCoW通过直接集成日记标签来扩展预训练的Whisper模型,消除了对说话者嵌入的依赖,并减少了对大量特定于说话者的训练数据的需求。我们的方法引入了帧级依赖于diarization-dependent变换(FDDT)和查询键偏置(QKb)技术,以改进模型对目标说话者的关注,同时有效地处理重叠语音。通过利用日记化输出作为调节信号,DiCoW简化了多扬声器ASR的工作流程,提高了对未见过扬声器的泛化能力,并在现实世界的多扬声器录音中实现了更可靠的转录。此外,我们探讨了一个连接主义的时间分类(CTC)头耳语的整合,并证明了它的能力,以提高转录效率,通过混合解码。值得注意的是,我们表明,我们的方法不仅限于耳语;它也提供了类似的好处时,应用于Branchformer模型。我们在真实世界的数据集上验证了DiCoW,包括来自CHiME-8挑战的AMI和NOTSOFAR-1,以及Libri 2 Mix和LibriCSS等合成基准,从而可以与以前的方法进行直接比较。结果表明,DiCoW增强了模型的目标扬声器ASR功能,同时保持Whisper对单扬声器数据的准确性和鲁棒性。
摘要:Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.
标题: 使用强化学习适应混乱语音的LLM语音识别
链接:https://arxiv.org/abs/2501.00039
备注:Accepted at ICASSP 2025
摘要:我们介绍了一个能够处理语音输入的大型语言模型(LLM),并表明通过对人类偏好的强化学习(RLHF)对其进行进一步调整,使其能够比传统的微调更好地适应无序语音。我们的方法取代低频文本令牌在LLM的词汇与音频令牌,并使该模型能够识别语音微调语音与成绩单。然后,我们使用RL与奖励的基础上,句法和语义的准确性措施推广的LLM进一步识别无序的语音。虽然由此产生的LLM并没有优于现有的语音识别系统,但我们发现,使用自定义奖励的强化学习调整比语言模型的监督微调带来了更好的性能,特别是在适应不同环境中的语音时。这为使用大型语言模型的语音识别提供了一种引人注目的替代调优策略。
摘要:We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
标题: OmniChat:利用可扩展的合成数据增强语音对话系统,以适应不同场景
链接:https://arxiv.org/abs/2501.01384
摘要:随着大型语言模型的快速发展,研究人员已经创建了越来越先进的口语对话系统,可以自然地与人类交谈。然而,这些系统仍然难以处理真实世界对话的全部复杂性,包括音频事件,音乐上下文和情感表达,主要是因为当前的对话数据集在规模和场景多样性方面都受到限制。在本文中,我们建议利用合成数据来增强跨不同场景的对话模型。我们介绍了ShareChatX,这是第一个全面的,大规模的口语对话数据集,涵盖了不同的场景。基于这个数据集,我们介绍了OmniChat,一个多回合的对话系统,具有异构特征融合模块,旨在优化不同对话环境中的特征选择。此外,我们还探讨了使用合成数据训练对话系统的关键方面。通过全面的实验,我们确定了合成数据和真实数据之间的理想平衡,在真实世界的对话数据集DailyTalk上实现了最先进的结果。我们还强调了合成数据在处理多样化、复杂的对话场景中的至关重要性,特别是涉及音频和音乐的场景。有关详细信息,请访问我们的演示页面,网址为\url{https://sharechatx.github.io/}。
摘要:With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.
标题: AdaptVC:具有自适应学习的高质量语音转换
链接:https://arxiv.org/abs/2501.01347
备注:4 pages, 3 figures. Audio samples are available in the demo page: this https URL
摘要:语音转换的目标是将源说话人的语音转换为参考说话人的语音,同时保留原始内容。一个关键的挑战是从源中提取分离的语言内容和从参考中提取语音风格。虽然现有的方法利用各种方法来隔离两者,但仍需要进一步关注泛化,特别是对于zero-shot场景中的鲁棒性。在本文中,我们实现了成功的解开的内容和扬声器功能调整自监督语音功能与适配器。适配器经过训练,可以从丰富的自监督特征中动态编码细微差别的特征,解码器将它们融合,以产生与参考准确相似的语音,同时将内容损失降至最低。此外,我们利用具有交叉注意扬声器条件反射的条件流匹配解码器来进一步提高合成质量和效率。在zero-shot场景下的主观和客观评价表明,所提出的方法优于现有的模型在语音质量和相似度的参考语音。
摘要:The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.
标题: RingFormer:具有环注意力和卷积增强Transformer的神经声码器
链接:https://arxiv.org/abs/2501.01182
摘要:虽然Transformers在各种音频任务中表现出出色的性能,但它们在神经声码器中的应用仍然具有挑战性。神经声码器需要在样本级生成长音频信号,这需要高的时间分辨率。这导致了注意力地图生成的显著计算成本,并限制了它们有效处理全局和局部信息的能力。此外,神经声码器中样本生成的顺序性质给实时处理带来了困难,使得直接采用Transformers不切实际。为了解决这些挑战,我们提出了RingFormer,一个神经声码器,它将环注意力机制纳入一个轻量级的Transformer变体,卷积增强的Transformer(Conformer)。Ring attention在整合全局信息的同时有效地捕捉局部细节,使其非常适合处理长序列并实现实时音频生成。RingFormer使用具有两个鉴别器的对抗训练进行训练。该模型应用于文本到语音模型VITS的解码器,并与最先进的声码器,如HiFi-GAN,iSTFT-Net和BigVGAN在相同的条件下,使用各种客观和主观的指标进行比较。实验结果表明,RingFormer实现了与现有模型相当或更高的性能,特别是在实时音频生成方面表现出色。我们的代码和音频示例可以在GitHub上找到。
摘要:While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real-time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution-augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well-suited for processing long sequences and enabling real-time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text-to-speech model VITS and compared with state-of-the-art vocoders such as HiFi-GAN, iSTFT-Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real-time audio generation. Our code and audio samples are available on GitHub.
标题: 使用深度神经决策树和森林从咳嗽声音中稳健地检测COVID-19:全面的跨数据集评估
链接:https://arxiv.org/abs/2501.01117
备注:39 pages
摘要:这项研究提出了一种使用尖端机器学习技术对COVID-19咳嗽声进行分类的稳健方法。利用深度神经决策树和深度神经决策森林,我们的方法在不同的咳嗽声音数据集上表现出一致的性能。我们从全面提取特征开始,以捕捉来自个人的广泛音频特征,无论是COVID-19阳性还是阴性。为了确定最重要的特征,我们使用递归特征消除和交叉验证。贝叶斯优化微调深度神经决策树和深度神经决策森林模型的超参数。此外,我们在训练过程中集成了SMOTE,以确保正面和负面数据的平衡表示。通过阈值优化实现模型性能优化,最大化ROC-AUC评分。我们的方法在五个数据集上进行了全面的评估:Cambridge,Coswara,COUGHVID,Virufy以及Virufy与NoCoCoDa数据集的组合。与最先进的方法相比,我们提出的方法在各个数据集上产生了显著的AUC分数,分别为0.97、0.98、0.92、0.93、0.99和0.99。将所有数据集合并为一个组合数据集,我们的方法使用深度神经决策森林分类器,实现了0.97的AUC。此外,我们的研究还包括全面的跨数据集分析,揭示了与COVID-19相关的咳嗽声的人口统计学和地理差异。这些差异凸显了在不同数据集之间转移学习特征的挑战,并强调了数据集集成的潜在好处,提高了可推广性并增强了从音频信号中检测COVID-19的能力。
摘要:This research presents a robust approach to classifying COVID-19 cough sounds using cutting-edge machine-learning techniques. Leveraging deep neural decision trees and deep neural decision forests, our methodology demonstrates consistent performance across diverse cough sound datasets. We begin with a comprehensive extraction of features to capture a wide range of audio features from individuals, whether COVID-19 positive or negative. To determine the most important features, we use recursive feature elimination along with cross-validation. Bayesian optimization fine-tunes hyper-parameters of deep neural decision tree and deep neural decision forest models. Additionally, we integrate the SMOTE during training to ensure a balanced representation of positive and negative data. Model performance refinement is achieved through threshold optimization, maximizing the ROC-AUC score. Our approach undergoes a comprehensive evaluation in five datasets: Cambridge, Coswara, COUGHVID, Virufy, and the combined Virufy with the NoCoCoDa dataset. Consistently outperforming state-of-the-art methods, our proposed approach yields notable AUC scores of 0.97, 0.98, 0.92, 0.93, 0.99, and 0.99 across the respective datasets. Merging all datasets into a combined dataset, our method, using a deep neural decision forest classifier, achieves an AUC of 0.97. Also, our study includes a comprehensive cross-datasets analysis, revealing demographic and geographic differences in the cough sounds associated with COVID-19. These differences highlight the challenges in transferring learned features across diverse datasets and underscore the potential benefits of dataset integration, improving generalizability and enhancing COVID-19 detection from audio signals.
标题: MuQ:使用Mel剩余量量化的自监督音乐表示学习
链接:https://arxiv.org/abs/2501.01108
摘要:近年来,在各种音乐信息学理解任务中,包括音乐标记、乐器分类、键检测等,使用自监督学习(SSL)预训练的基础模型取得了成功。在本文中,我们提出了一个自我监督的音乐表示学习模型的音乐理解。与以往采用随机投影或现有神经编解码器的研究不同,该模型名为MuQ,用于训练预测Mel残差矢量量化(Mel-RVQ)生成的令牌。我们的Mel-RVQ利用残差线性投影结构的梅尔频谱量化,以提高目标提取的稳定性和效率,并导致更好的性能。在大量下游任务中的实验表明,MuQ优于以前的自监督音乐表示模型,只有0.9K小时的开源预训练数据。将数据扩展到超过16万小时,并采用迭代训练,可持续提高模型性能。为了进一步验证我们的模型的强度,我们提出了MuQ-MuLan,一个基于对比学习的联合音乐文本嵌入模型,它在MagnaTagATune数据集上的zero-shot音乐标记任务中实现了最先进的性能。代码和检查点在https://github.com/tencent-ailab/MuQ上是开源的。
摘要:Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.
标题: Fast:快速音频频谱图Transformer
链接:https://arxiv.org/abs/2501.01104
备注:Accepted at ICASSP 2025
摘要:在音频分类中,开发高效和鲁棒的模型对于实时应用至关重要。受MobileViT设计原则的启发,我们提出了FAST(快速音频频谱图Transformer),这是一种结合卷积神经网络(CNN)和Transformers的新架构,以利用两者的优势。FAST将CNN的本地特征提取效率与Transformers的全局上下文建模功能相结合,从而产生了一个功能强大但重量轻的模型,非常适合实时或移动用例。此外,我们采用Lipschitz连续注意力机制来提高训练稳定性和加速收敛。我们在ADIMA数据集上评估FAST,ADIMA数据集是一个面向实时亵渎和滥用检测的多语言语料库,以及更传统的AudioSet。我们的研究结果表明,FAST在ADIMA和AudioSet分类任务上都达到了最先进的性能,在某些情况下,它超过了现有的基准,同时使用的参数减少了150倍。
摘要:In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.
标题: MMVA:基于图像、音乐和音乐字幕之间的化合价和觉醒的多模式匹配
链接:https://arxiv.org/abs/2501.01094
备注:Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025
摘要:我们引入了基于效价和唤醒(MMVA)的多模式匹配,这是一个三模式编码器框架,旨在捕获图像、音乐和音乐字幕中的情感内容。为了支持这个框架,我们扩展了Image-Music-Decision-Matching-Net(IMEMNet)数据集,创建了IMEMNet-C,其中包括24,756张图像和25,944个带有相应音乐标题的音乐片段。我们采用基于连续效价(情绪积极性)和唤醒(情绪强度)值的多模态匹配分数。这种连续的匹配分数允许在训练期间通过计算来自不同模态的效价唤醒值的相似性分数来对图像-音乐对进行随机采样。因此,所提出的方法在效价唤醒预测任务中实现了最先进的性能。此外,该框架在各种zeroshot任务中证明了其有效性,突出了下游应用中效价和唤醒预测的潜力。
摘要:We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
标题: 推进新加坡式英语的理解:用数据集和多模式模型弥合差距
链接:https://arxiv.org/abs/2501.01034
备注:Open-Source: this https URL
摘要:新加坡式英语是一种源于英语的克里奥尔语,是多语言和多元文化背景下的语言学研究的重点。然而,它的口语形式仍然没有得到充分的探索,限制了对其语言结构和应用的深入了解。为了解决这个差距,我们标准化和注释最大的口语新加坡式英语语料库,介绍了多任务国家语音语料库(MNSC)。这些数据集支持多种任务,包括自动语音识别(ASR)、口语问答(SQA)、口语对话摘要(SDS)和副语言问答(PQA)。我们发布了标准化的分割和人类验证的测试集,以促进进一步的研究。此外,我们提出了SingAudioLLM,一个多任务多模态模型,利用多模态大型语言模型来同时处理这些任务。实验表明,我们的模型适应新加坡英语的背景下,实现国家的最先进的性能和优于现有的模型相比,其他AudioLLM和级联解决方案的10-30%。
摘要:Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.
标题: U-Gift:针对Few-Shot场景中有毒言语的不确定性引导防火墙
链接:https://arxiv.org/abs/2501.00907
备注:16 pages, 6 figures and 10 tables. Comments are welcome
摘要:随着社交媒体的广泛使用,在线平台上的用户生成内容激增。当此类内容包含仇恨、辱骂、攻击或网络欺凌行为时,它被归类为有毒言论,对在线生态系统的完整性和安全构成重大威胁。虽然手动内容审核仍然很普遍,但大量的内容和人类审核员的心理压力强调了自动有毒语音检测的必要性。以前提出的检测方法通常依赖于大型注释数据集;然而,获取这样的数据集在实践中既昂贵又具有挑战性。为了解决这个问题,我们提出了一个不确定性引导的防火墙有毒语音在Few-Shot的情况下,U-GIFT,利用自我训练,以提高检测性能,即使标记的数据是有限的。具体来说,U-GIFT将主动学习与贝叶斯神经网络(BNN)相结合,从未标记的数据中自动识别高质量的样本,根据模型预测得出的不确定性估计,优先选择具有更高置信度的伪标签进行训练。大量的实验表明,U-GIFT显着优于竞争对手的基线在Few-Shot检测场景。在5次拍摄设置中,它比基本模型实现了14.92%的性能改进。重要的是,U-GIFT是用户友好的,并可适应各种预训练的语言模型(PLM)。它还在样本不平衡和跨域设置的情况下表现出强大的性能,同时在各种语言应用程序中展示了强大的泛化能力。我们相信,U-GIFT为Few-Shot有毒语音检测提供了有效的解决方案,为网络空间的自动内容审核提供了实质性支持,从而充当了促进网络安全进步的防火墙。
摘要:With the widespread use of social media, user-generated content has surged on online platforms. When such content includes hateful, abusive, offensive, or cyberbullying behavior, it is classified as toxic speech, posing a significant threat to the online ecosystem's integrity and safety. While manual content moderation is still prevalent, the overwhelming volume of content and the psychological strain on human moderators underscore the need for automated toxic speech detection. Previously proposed detection methods often rely on large annotated datasets; however, acquiring such datasets is both costly and challenging in practice. To address this issue, we propose an uncertainty-guided firewall for toxic speech in few-shot scenarios, U-GIFT, that utilizes self-training to enhance detection performance even when labeled data is limited. Specifically, U-GIFT combines active learning with Bayesian Neural Networks (BNNs) to automatically identify high-quality samples from unlabeled data, prioritizing the selection of pseudo-labels with higher confidence for training based on uncertainty estimates derived from model predictions. Extensive experiments demonstrate that U-GIFT significantly outperforms competitive baselines in few-shot detection scenarios. In the 5-shot setting, it achieves a 14.92\% performance improvement over the basic model. Importantly, U-GIFT is user-friendly and adaptable to various pre-trained language models (PLMs). It also exhibits robust performance in scenarios with sample imbalance and cross-domain settings, while showcasing strong generalization across various language applications. We believe that U-GIFT provides an efficient solution for few-shot toxic speech detection, offering substantial support for automated content moderation in cyberspace, thereby acting as a firewall to promote advancements in cybersecurity.
标题: SoundBrush:声音作为视觉场景编辑的画笔
链接:https://arxiv.org/abs/2501.00645
备注:AAAI 2025
摘要:我们提出了SoundBrush,一种使用声音作为画笔来编辑和操纵视觉场景的模型。我们扩展了潜在扩散模型(LDM)的生成能力,将音频信息用于编辑视觉场景。受现有图像编辑工作的启发,我们将此任务视为监督学习问题,并利用各种现成的模型来构建声音配对的视觉场景数据集进行训练。这个丰富生成的数据集使SoundBrush能够学习将音频特征映射到LDM的文本空间中,从而允许在各种野外声音的指导下进行视觉场景编辑。与现有方法不同,SoundBrush可以准确地操纵整体场景,甚至插入发声对象,以最佳匹配音频输入,同时保留原始内容。此外,通过与新的视图合成技术相结合,我们的框架可以扩展到编辑3D场景,促进声音驱动的3D场景操作。演示可在https://soundbrush.github.io/上获得。
摘要:We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at https://soundbrush.github.io/.
标题: 使用口语训练和评估抑郁症风险模型的数据库大小要求
链接:https://arxiv.org/abs/2501.00617
备注:None
摘要:心理健康风险预测是言语社区中一个不断发展的领域,但许多研究都是基于小型语料库。本研究说明了试验和训练组尺寸的变化如何影响对照研究中的性能。使用超过65 K标记数据点的语料库,提供了不同训练/测试尺寸组合的完全交叉设计的结果。包括两种模型类型:一种基于语言,另一种基于语音声学。两者都使用该领域中的当前方法。还包括年龄不匹配的测试集。结果表明:(1)小于1 K样本的测试大小会产生噪声结果,即使对于较大的训练集大小也是如此;(2)稳定的结果需要至少2K的训练集大小;(3)NLP和声学模型的表现与训练/测试大小的变化相似,以及(4)不匹配的测试集显示出与匹配的测试集相同的模式。讨论了其他因素,包括标签先验,模型强度和预训练,独特的扬声器和数据长度。虽然没有单一的研究可以指定确切的尺寸要求,结果表明,需要适当大小的训练和测试集的心理健康风险预测从语音和语言的未来研究。
摘要:Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.
标题: 优化语音输入长度以实现与说话者无关的抑郁症分类
链接:https://arxiv.org/abs/2501.00608
备注:None
摘要:用于基于语音的抑郁症分类的机器学习模型为医疗保健应用提供了希望。尽管抑郁症分类的研究越来越多,但人们对语音输入的长度如何影响模型性能知之甚少。我们使用来自人机健康筛查应用程序的超过1400小时的语音语料库分析了与说话者无关的抑郁症分类结果。我们研究性能作为响应输入长度的函数,两个NLP系统,不同的整体性能。 这两个系统的结果表明,性能取决于自然长度,流逝的长度,并在一个会话中的响应顺序。系统共享最小长度阈值,但在响应饱和阈值方面不同,后者对于更好的系统更高。在饱和状态下,最好向说话者提出一个新的问题,而不是继续当前的回答。这些和其他报告的结果表明,如何应用程序可以更好地设计,以引出和处理抑郁症分类的最佳输入长度。
摘要:Machine learning models for speech-based depression classification offer promise for health care applications. Despite growing work on depression classification, little is understood about how the length of speech-input impacts model performance. We analyze results for speaker-independent depression classification using a corpus of over 1400 hours of speech from a human-machine health screening application. We examine performance as a function of response input length for two NLP systems that differ in overall performance. Results for both systems show that performance depends on natural length, elapsed length, and ordering of the response within a session. Systems share a minimum length threshold, but differ in a response saturation threshold, with the latter higher for the better system. At saturation it is better to pose a new question to the speaker, than to continue the current response. These and additional reported results suggest how applications can be better designed to both elicit and process optimal input lengths for depression classification.
标题: Fotheidil:爱尔兰语自动转录系统
链接:https://arxiv.org/abs/2501.00509
备注:Accepted to the 5th Celtic Language Technology Workshop within COLING 2025
摘要:本文介绍了第一个基于网络的爱尔兰语转录系统- Fotheidil,该系统利用语音相关的人工智能技术作为ABAIR计划的一部分。该系统包括现成的预先训练的语音活动检测和扬声器diarisation模型和专门为爱尔兰自动语音识别和大写和标点符号恢复训练的模型。探索半监督学习来改进模块化TDNN-HMM ASR系统的声学模型,从而对在监督训练集中代表性不足的域外测试集和方言产生实质性改进。一种新的方法,涉及序列到序列模型的资本化和标点符号恢复与传统的方法相比,使用分类模型。实验结果表明,在这里也有很大的改善性能。该系统将免费供公众使用,是研究人员和其他转录爱尔兰语言材料的重要资源。当使用该系统时,将收集人工校正的transmittance并将其包含在训练数据集中,这将以周期性的,社区驱动的方式逐步改进ASR模型。
摘要:This paper sets out the first web-based transcription system for the Irish language - Fotheidil, a system that utilises speech-related AI technologies as part of the ABAIR initiative. The system includes both off-the-shelf pre-trained voice activity detection and speaker diarisation models and models trained specifically for Irish automatic speech recognition and capitalisation and punctuation restoration. Semi-supervised learning is explored to improve the acoustic model of a modular TDNN-HMM ASR system, yielding substantial improvements for out-of-domain test sets and dialects that are underrepresented in the supervised training set. A novel approach to capitalisation and punctuation restoration involving sequence-to-sequence models is compared with the conventional approach using a classification model. Experimental results show here also substantial improvements in performance. The system will be made freely available for public use, and represents an important resource to researchers and others who transcribe Irish language materials. Human-corrected transcriptions will be collected and included in the training dataset as the system is used, which should lead to incremental improvements to the ASR model in a cyclical, community-driven fashion.
标题: 展开创造性对抗网络,生成新颖的音乐片段
链接:https://arxiv.org/abs/2501.00452
摘要:近年来,音乐生成已成为人工智能和机器学习的一个重要主题。在最近的工作中,基于RNN的神经网络方法已被应用于序列生成。相比之下,生成对抗网络(GANs)及其对应物已经被很少的研究人员用于音乐生成。 在本文中,一个经典的系统被用来与一个新的系统,以产生创造性的音乐。这两个系统都是基于对抗网络设计的,通过从示例中学习来生成音乐。经典系统被训练来学习一组音乐作品,而不区分类别,而新系统被训练来学习不同的作曲家及其风格,以通过偏离所学习的作曲家的风格来生成创意音乐作品。 使用的基本结构是生成对抗网络(GANs),它能够在给定一组输入的情况下生成新的输出,以学习和模仿它们的分布。在以前的工作中已经表明,GANs在创造性输出方面的原始设计是有限的。基于创意对抗网络(CAN),这项工作将其应用于音乐领域,而不是视觉艺术领域。此外,还引入了展开的CAN以防止模式崩溃。在GAN和CAN上进行了用于生成音乐的实验,并且根据与输入集的偏差来测量它们的能力。
摘要:Music generation has been established as a prominent topic in artificial intelligence and machine learning over recent years. In most recent works on RNN-based neural network methods have been applied for sequence generation. In contrast, generative adversarial networks (GANs) and their counterparts have been explored by very few researchersfor music generation. In this paper, a classical system was employed alongside a new system to generate creative music. Both systems were designed based on adversarial networks to generate music by learning from examples. The classical system was trained to learn a set of music pieces without differentiating between classes, whereas the new system was trained to learn the different composers and their styles to generate a creative music piece by deviating from the learned composers' styles. The base structure utilized was generative adversarial networks (GANs), which are capable of generating novel outputs given a set of inputs to learn from and mimic their distribution. It has been shown in previous work that GANs are limited in their original design with respect to creative outputs. Building on the Creative Adversarial Networks (CAN) , this work applied them in the music domain rather than the visual art domain. Additionally, unrolled CAN was introduced to prevent mode collapse. Experiments were conducted on both GAN and CAN for generating music, and their capabilities were measured in terms of deviation from the input set.
标题: Whisper变得更强大:增强Wav2Vec 2.0以在低资源语言中实现卓越的ASB
链接:https://arxiv.org/abs/2501.00425
备注:15 pagesm 3 figures
摘要:在低资源语言中处理语音到文本和自动语音识别问题是众所周知的挑战,因为缺乏经过验证的数据集和方言的多样性。阿拉伯语、俄语和葡萄牙语克服了这些困难,由于这些语言在世界各地的许多方言,它们是低资源语言。此外,这些语言的口音和发音的多样性使ASR模型的成功变得复杂。随着深度学习和Transformers的日益普及,与最先进的方法相比,著名的Wav2Vec2等声学模型在语音识别领域取得了卓越的性能。然而,尽管Wav2Vec2比传统方法提高了效率,但对于代表性不足的语言,其性能显着下降,即使它需要的标记数据显着减少。本文介绍了一个端到端的框架,通过数据增强技术,增强了ASR系统在Wav2Vec2上的微调。为了验证我们的框架的有效性,我们使用Mozilla的阿拉伯语,俄语和葡萄牙语的Common Voice项目的三个数据集进行了详细的实验评估。此外,本文提出的框架证明了不同的变音符号的鲁棒性。最终,我们的方法优于之前的两个基线模型,即预训练的Wav2Vec2和众所周知的Whisper ASR模型,导致单词错误率平均相对改善33.9%,字符错误率相对改善53.2%。
摘要:Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models' success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2's improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework's effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla's Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9\% in Word Error Rate and a 53.2\% relative improvement in Character Error Rate.
标题: TSPE:用于改进Zero-Shot音频分类的特定任务提示集合
链接:https://arxiv.org/abs/2501.00398
备注:5 pages
摘要:音频语言模型(ALM)擅长zero-shot音频分类,这是一项任务,其中模型通过利用描述性自然语言提示在测试时对先前未看到的音频片段进行分类。我们介绍了TSPE(任务特定提示Entrance),一个简单的,无训练的硬提示方法,提高ALE的zero-shot性能,通过定制不同的音频分类任务的提示。我们不是使用“汽车的声音”等基于通用模板的提示,而是生成上下文丰富的提示,例如“来自隧道的汽车的声音”。具体来说,我们利用标签信息来识别合适的声音属性,如“响亮”和“微弱”,以及适当的声源,如“隧道”和“街道”,并将此信息纳入音频语言模型(ALM)用于音频分类的提示中。此外,为了增强音频文本对齐,我们在TSPE生成的特定任务提示中执行提示集成。当在12个不同的音频分类数据集上进行评估时,TSPE通过显示出比普通zero-shot评估的1.23-16.36%的绝对改善来提高ALM的性能。
摘要:Audio-language models (ALMs) excel in zero-shot audio classification, a task where models classify previously unseen audio clips at test time by leveraging descriptive natural language prompts. We introduce TSPE (Task-Specific Prompt Ensemble), a simple, training-free hard prompting method that boosts ALEs' zero-shot performance by customizing prompts for diverse audio classification tasks. Rather than using generic template-based prompts like "Sound of a car" we generate context-rich prompts, such as "Sound of a car coming from a tunnel". Specifically, we leverage label information to identify suitable sound attributes, such as "loud" and "feeble", and appropriate sound sources, such as "tunnel" and "street" and incorporate this information into the prompts used by Audio-Language Models (ALMs) for audio classification. Further, to enhance audio-text alignment, we perform prompt ensemble across TSPE-generated task-specific prompts. When evaluated on 12 diverse audio classification datasets, TSPE improves performance across ALMs by showing an absolute improvement of 1.23-16.36% over vanilla zero-shot evaluation.
标题: 用于语音分类的尖峰神经网络中的时间信息重建和非对齐残留
链接:https://arxiv.org/abs/2501.00348
备注:9 pages, 5 figures
摘要:近年来,人们注意到大多数基于脉冲神经网络(spiking neural networks,SNNs)的模型在处理语音分类问题时,都只使用同一时间尺度的时间分辨率,这使得这些模型不能学习不同时间尺度上的输入数据信息。此外,由于许多模型的子模块前后数据的时间长度不同,无法应用有效的残差连接来优化这些模型的训练过程。为了解决这些问题,一方面,本文提出了一种新的时域重构方法,即时域重构(TR)方法,通过参考人脑理解语音的分层处理过程。然后,重建的SNN模型与TR可以学习输入数据在不同的时间尺度上的信息和模型更全面的语义信息,从音频数据,因为它使网络学习输入数据在不同的时间分辨率的信息。另一方面,我们通过分析音频数据,提出了非对齐残差(NAR)方法,它允许残差连接可以用于两个不同时间长度的音频数据。我们在Spiking Speech Commands(SSC)、Spiking Heidelberg Digits(SHD)和Google Speech Commands v0.02(GSC)数据集上进行了大量的实验。实验结果表明,所有SNN模型的测试分类准确率在SSC上达到了SOTA的81.02%,在SHD上达到了SOTA的96.04%。
摘要:Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these models.To solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02\% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04\% on SHD for the classification accuracy of all models.
标题: Vox Vietnam:用于越南语说话人识别的大规模多流派数据集
链接:https://arxiv.org/abs/2501.00328
备注:Accepted to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
摘要:最近的研究在说话人识别的目的是解决由于注册和测试的话语之间的变化,特别是在多体裁的现象,其中的话语是在不同的语音体裁的漏洞。以前的资源越南语说话人识别要么是有限的规模或不注重体裁多样性,离开研究多体裁的影响未探索。本文介绍了VoxVietnam,这是第一个用于越南语说话人识别的多体裁数据集,包含来自1,406个说话人的187,000多个话语,以及一个自动化管道,用于从公共资源中大规模构建数据集。我们的实验显示了多流派现象对在单流派数据集上训练的模型所带来的挑战,并证明了将VoxVietnam纳入训练过程后性能的显着提高。我们的实验进行了研究的挑战,多体裁现象在说话人识别和性能增益时,所提出的数据集用于多体裁训练。
摘要:Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study the challenges of the multi-genre phenomenon in speaker recognition and the performance gain when the proposed dataset is used for multi-genre training.
标题: 语音评估分类器的集成
链接:https://arxiv.org/abs/2501.00067
摘要:本文描述了一种尝试,应用集成的二进制分类器来解决医学语音评估的问题。一个数据集的基础上编制的定量和专家评估音节发音质量。选取动态时间扭曲距离、Minkowski距离、相关系数、最长公共子序列(LCSS)、实序列编辑距离(EDR)、实惩罚编辑距离(ERP)和合并分裂(MSM)7个指标作为特征。专家对语音质量的评估被用作类别标签:1级表示高质量语音,0级表示失真。对逻辑回归(LR)、支持向量机(SVM)、朴素贝叶斯(NB)、决策树(DT)和K-最近邻(KNN)五种分类方法的训练结果进行了比较。使用的混合物的方法来建立一个集成的分类器的结果也被提出。与使用单独的二进制分类器相比,对所研究的数据集使用一个扩展器使我们能够稍微提高分类精度。
摘要:The article describes an attempt to apply an ensemble of binary classifiers to solve the problem of speech assessment in medicine. A dataset was compiled based on quantitative and expert assessments of syllable pronunciation quality. Quantitative assessments of 7 selected metrics were used as features: dynamic time warp distance, Minkowski distance, correlation coefficient, longest common subsequence (LCSS), edit distance of real se-quence (EDR), edit distance with real penalty (ERP), and merge split (MSM). Expert as-sessment of pronunciation quality was used as a class label: class 1 means high-quality speech, class 0 means distorted. A comparison of training results was carried out for five classification methods: logistic regression (LR), support vector machine (SVM), naive Bayes (NB), decision trees (DT), and K-nearest neighbors (KNN). The results of using the mixture method to build an ensemble of classifiers are also presented. The use of an en-semble for the studied data sets allowed us to slightly increase the classification accuracy compared to the use of individual binary classifiers.
标题: Lungmix:一种基于混合的呼吸声分类推广策略
链接:https://arxiv.org/abs/2501.00064
备注:4pages, 3 figures, conference paper
摘要:呼吸音分类在呼吸系统疾病的诊断中起着至关重要的作用。虽然深度学习模型在各种呼吸声数据集上取得了成功,但我们的实验表明,在一个数据集上训练的模型通常无法有效地推广到其他数据集,这主要是由于数据收集和注释不一致。为了解决这一限制,我们引入了\n {Lungmix},这是一种受Mixup启发的新型数据增强技术。Lungmix通过使用响度和随机掩码混合波形来生成增强数据,同时根据它们的语义插入标签,帮助模型学习更一般化的表示。对ICBHI、SPR和HF三个数据集的综合评估表明,Lungmix显著增强了模型对未知数据的泛化能力。特别是,Lungmix将4类分类得分提高了3.55%,实现了与直接在目标数据集上训练的模型相当的性能。
摘要:Respiratory sound classification plays a pivotal role in diagnosing respiratory diseases. While deep learning models have shown success with various respiratory sound datasets, our experiments indicate that models trained on one dataset often fail to generalize effectively to others, mainly due to data collection and annotation \emph{inconsistencies}. To address this limitation, we introduce \emph{Lungmix}, a novel data augmentation technique inspired by Mixup. Lungmix generates augmented data by blending waveforms using loudness and random masks while interpolating labels based on their semantic meaning, helping the model learn more generalized representations. Comprehensive evaluations across three datasets, namely ICBHI, SPR, and HF, demonstrate that Lungmix significantly enhances model generalization to unseen data. In particular, Lungmix boosts the 4-class classification score by up to 3.55\%, achieving performance comparable to models trained directly on the target dataset.
标题: 基于声音的触摸手势和情绪识别以增强人机交互
链接:https://arxiv.org/abs/2501.00038
备注:ICASSP 2025
摘要:情感识别和触摸手势解码对于推进人机交互(HRI)至关重要,特别是在情感线索和触觉感知发挥重要作用的社交环境中。然而,许多类人机器人,如Pepper,Nao和Furhat,缺乏全身触觉皮肤,限制了它们参与基于触摸的情感和手势交互的能力。此外,由于需要收集个人面部数据,基于视觉的情感识别方法通常面临严格的GDPR合规性挑战。为了解决这些局限性,避免隐私问题,本文研究了使用触摸HRI过程中产生的声音来识别触觉手势,并沿着唤醒和效价维度对情绪进行分类的潜力。使用来自28名参与者与人形机器人Pepper的触觉手势和情感交互的数据集,我们设计了一个仅含音频的轻量级触摸手势和情感识别模型,参数仅为0.24M,模型大小为0.94MB,FLOPs为0.7G。实验结果表明,当输入音频长度变化时,所提出的基于声音的触摸手势和情感识别模型能够有效地识别不同情感的唤醒和效价状态,以及各种触摸手势。所提出的模型是低延迟的,并实现了与众所周知的预训练音频神经网络(PANN)类似的结果,但具有更小的FLOP,参数和模型大小。
摘要:Emotion recognition and touch gesture decoding are crucial for advancing human-robot interaction (HRI), especially in social environments where emotional cues and tactile perception play important roles. However, many humanoid robots, such as Pepper, Nao, and Furhat, lack full-body tactile skin, limiting their ability to engage in touch-based emotional and gesture interactions. In addition, vision-based emotion recognition methods usually face strict GDPR compliance challenges due to the need to collect personal facial data. To address these limitations and avoid privacy issues, this paper studies the potential of using the sounds produced by touching during HRI to recognise tactile gestures and classify emotions along the arousal and valence dimensions. Using a dataset of tactile gestures and emotional interactions from 28 participants with the humanoid robot Pepper, we design an audio-only lightweight touch gesture and emotion recognition model with only 0.24M parameters, 0.94MB model size, and 0.7G FLOPs. Experimental results show that the proposed sound-based touch gesture and emotion recognition model effectively recognises the arousal and valence states of different emotions, as well as various tactile gestures, when the input audio length varies. The proposed model is low-latency and achieves similar results as well-known pretrained audio neural networks (PANNs), but with much smaller FLOPs, parameters, and model size.
标题: SECodec:用于语音语言模型的基于结构信息的压缩语音表示编解码器
链接:https://arxiv.org/abs/2501.00018
备注:Accepted to the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)
摘要:随着大型语言模型(LLM)的快速发展,离散语音表示已成为将语音集成到LLM中的关键。现有的语音表示离散化方法依赖于预定义的码本大小和基于欧氏距离的量化。然而,1)码本的大小是影响编解码器性能和下游任务训练效率两者的关键参数。2)当码本的大小被控制在合理的范围内时,基于欧几里德距离的量化可能导致音频失真。事实上,在信息压缩领域,结构信息和熵指导是至关重要的,但以前的方法在很大程度上忽略了这些因素。因此,我们从信息论的角度来解决上述问题,我们提出了SECodec,一种新的语音表示编解码器的基础上结构熵(SE)的语音语言模型的建设。具体来说,我们首先将语音建模为一个图,聚类图中的语音特征节点,并通过分层和分解最小化2D SE来提取相应的码本。然后,为了解决音频失真的问题,我们提出了一种新的量化方法。该方法仍然坚持二维SE最小化原则,自适应地为每个输入的原始语音节点选择最合适的令牌对应的集群。此外,我们开发了一个基于结构熵的语音语言模型(SESLM),利用SECodec。实验结果表明,SECodec在语音重建中优于EnCodec,SESLM在zero-shot文本到语音转换任务中优于VALL-E。代码、演示语音、语音特征图、SE码本和模型可在https://github.com/wlq2019/SECodec上获得。
摘要:With the rapid advancement of large language models (LLMs), discrete speech representations have become crucial for integrating speech into LLMs. Existing methods for speech representation discretization rely on a predefined codebook size and Euclidean distance-based quantization. However, 1) the size of codebook is a critical parameter that affects both codec performance and downstream task training efficiency. 2) The Euclidean distance-based quantization may lead to audio distortion when the size of the codebook is controlled within a reasonable range. In fact, in the field of information compression, structural information and entropy guidance are crucial, but previous methods have largely overlooked these factors. Therefore, we address the above issues from an information-theoretic perspective, we present SECodec, a novel speech representation codec based on structural entropy (SE) for building speech language models. Specifically, we first model speech as a graph, clustering the speech features nodes within the graph and extracting the corresponding codebook by hierarchically and disentangledly minimizing 2D SE. Then, to address the issue of audio distortion, we propose a new quantization method. This method still adheres to the 2D SE minimization principle, adaptively selecting the most suitable token corresponding to the cluster for each incoming original speech node. Furthermore, we develop a Structural Entropy-based Speech Language Model (SESLM) that leverages SECodec. Experimental results demonstrate that SECodec performs comparably to EnCodec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks. Code, demo speeches, speech feature graph, SE codebook, and models are available at https://github.com/wlq2019/SECodec.