语音/音频处理学术速递[12.24]

文摘   2024-12-24 18:14   北京  
今日论文合集:cs.SD语音21篇,eess.AS音频处理32篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音

【1】 VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

标题: VERSA:语音、音频和音乐的多功能评估工具包
链接:https://arxiv.org/abs/2412.17667
作者: Jiatong Shi,  Hye-jin Shim,  Jinchuan Tian,  Siddhant Arora,  Haibin Wu,  Darius Petermann,  Jia Qi Yip,  You Zhang,  Yuxun Tang,  Wangyou Zhang,  Dareen Safar Alharthi,  Yichen Huang,  Koichi Saito,  Jionghao Han,  Yiwen Zhao,  Chris Donahue,  Shinji Watanabe
摘要:在这项工作中,我们介绍VERSA,一个统一的和标准化的评估工具包,设计用于各种语音,音频和音乐信号。该工具包具有Pythonic界面,具有灵活的配置和依赖控制,使其用户友好和高效。完整安装后,VERSA提供63个指标,根据不同的配置提供711个指标变化。这些指标包括利用各种外部资源的评估,包括匹配和不匹配的参考音频、文本翻译和文本标题。作为一个轻量级但全面的工具包,VERSA功能丰富,可支持评估各种下游场景。为了展示其功能,这项工作突出了VERSA的示例用例,包括音频编码,语音合成,语音增强,歌唱合成和音乐生成。该工具包可在https://github.com/shinjiwlab/versa上查阅。
摘要:In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 63 metrics with 711 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.

【2】 Multiple Consistency-guided Test-Time Adaptation for Contrastive  Audio-Language Models with Unlabeled Audio
标题: 具有未标记音频的对比音频语言模型的多重一致性引导测试时自适应
链接:https://arxiv.org/abs/2412.17306
作者: Gongyu Chen,  Haomin Zhang,  Chaofan Ding,  Zihao Chen,  Xinhan Di
备注:6 pages, 1 figure, accepted by ICASSP 2025
摘要:预训练的音频语言模型(ALM)学习的一个迷人的方面是它们令人印象深刻的zero-shot泛化能力和测试时自适应(TTA)方法,旨在提高没有注释的领域性能。然而,以前的测试时间自适应(TTA)方法的ALM在zero-shot分类往往被卡在不正确的模型预测。为了进一步提高性能,我们提出了在没有注释标签的情况下快速学习的多个指导。首先,对ALM的上下文令牌和域令牌设置一致性指导。其次,设置了每个单个测试样本的多个增强视图之间的一致性和不同测试样本之间的对比学习的指导。第三,我们提出了一个相应的端到端的学习框架,建议的测试时间适应方法没有注释标签。我们对跨领域的12个下游任务进行了广泛评估,与最先进的模型相比,我们提出的自适应方法使平均零触发性能(zero-shot performance)提高了4.41%(最大7.50%)。
摘要:One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, previous test time adaptation (TTA) methods for ALMs in zero-shot classification tend to be stuck in incorrect model predictions. In order to further boost the performance, we propose multiple guidance on prompt learning without annotated labels. First, guidance of consistency on both context tokens and domain tokens of ALMs is set. Second, guidance of both consistency across multiple augmented views of each single test sample and contrastive learning across different test samples is set. Third, we propose a corresponding end-end learning framework for the proposed test-time adaptation method without annotated labels. We extensively evaluate our approach on 12 downstream tasks across domains, our proposed adaptation method leads to 4.41% (max 7.50%) average zero-shot performance improvement in comparison with the state-of-the-art models.

【3】 Trainingless Adaptation of Pretrained Models for Environmental Sound  Classification
标题: 环境声音分类预训练模型的免训练适应
链接:https://arxiv.org/abs/2412.17212
作者: Noriyuki Tonami,  Wataru Kohno,  Keisuke Imoto,  Yoshiyuki Yajima,  Sakiko Mishima,  Reishi Kondo,  Tomoyuki Hino
备注:Accepted to ICASSP2025
摘要:用于环境声音分类的基于深度神经网络(DNN)的模型对于训练数据不属于的域(即,分布外或看不见的数据)不鲁棒。为了利用针对未知领域的预训练模型,诸如微调和迁移学习的自适应方法与丰富的计算资源一起使用,例如,图形处理器(GPU)。然而,对于那些计算资源贫乏的人来说,跟上研究趋势变得越来越困难,因为最先进的模型正在变得计算资源密集型。在本文中,我们提出了一种无训练适应方法的预训练模型的环境声音分类。为了介绍无训练自适应方法,我们首先提出了一种在DNN模型的中间层中恢复时频(TF-ish)结构的操作。然后,我们提出了无需训练的频率滤波方法的域自适应,这不是一个广泛使用的基于梯度的优化。在ESC-50数据集上进行的实验表明,与传统方法相比,该自适应方法的分类精度提高了20.40个百分点。
摘要:Deep neural network (DNN)-based models for environmental sound classification are not robust against a domain to which training data do not belong, that is, out-of-distribution or unseen data. To utilize pretrained models for the unseen domain, adaptation methods, such as finetuning and transfer learning, are used with rich computing resources, e.g., the graphical processing unit (GPU). However, it is becoming more difficult to keep up with research trends for those who have poor computing resources because state-of-the-art models are becoming computationally resource-intensive. In this paper, we propose a trainingless adaptation method for pretrained models for environmental sound classification. To introduce the trainingless adaptation method, we first propose an operation of recovering time--frequency-ish (TF-ish) structures in intermediate layers of DNN models. We then propose the trainingless frequency filtering method for domain adaptation, which is not a gradient-based optimization widely used. The experiments conducted using the ESC-50 dataset show that the proposed adaptation method improves the classification accuracy by 20.40 percentage points compared with the conventional method.

【4】 InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions
标题: InterDance:具有现实二重唱互动的反应式3D舞蹈世代
链接:https://arxiv.org/abs/2412.16982
作者: Ronghui Li,  Youliang Zhang,  Yachao Zhang,  Yuxiang Zhang,  Mingyang Su,  Jie Guo,  Ziwei Liu,  Yebin Liu,  Xiu Li
备注:this https URL
摘要:人类进行各种各样的互动动作,其中二重唱舞蹈是最具挑战性的互动之一。然而,在人体运动生成模型方面,现有的工作仍然无法生成高质量的交互动作,特别是在二人转领域。一方面是由于缺乏大规模的高质量数据集。另一方面,它源于交互运动的不完整表示和缺乏对交互的细粒度优化。为了解决这些挑战,我们提出了InterDance,这是一个大规模二重唱舞蹈数据集,可以显着增强运动质量、数据规模和舞蹈流派的多样性。基于此数据集,我们提出了一个新的运动表示,可以准确,全面地描述交互式运动。我们进一步引入了一个基于扩散的框架与交互细化指导策略,以优化交互的现实主义逐步。大量的实验证明了我们的数据集和算法的有效性。
摘要:Humans perform a variety of interactive motions, among which duet dance is one of the most challenging interactions. However, in terms of human motion generative models, existing works are still unable to generate high-quality interactive motions, especially in the field of duet dance. On the one hand, it is due to the lack of large-scale high-quality datasets. On the other hand, it arises from the incomplete representation of interactive motion and the lack of fine-grained optimization of interactions. To address these challenges, we propose, InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres. Built upon this dataset, we propose a new motion representation that can accurately and comprehensively describe interactive motion. We further introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively. Extensive experiments demonstrate the effectiveness of our dataset and algorithm.

【5】 AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory  Estimation and Classification
标题: AV-DTEC:用于无人机轨迹估计和分类的自我监督视听融合
链接:https://arxiv.org/abs/2412.16928
作者: Zhenyuan Xiao,  Yizhuo Yang,  Guili Xu,  Xianglong Zeng,  Shenghai Yuan
备注:Submitted to ICRA 2025
摘要:紧凑型无人机的使用越来越多,对公共安全造成了重大威胁,而传统的无人机探测系统往往体积庞大且成本高昂。为了应对这些挑战,我们提出了AV-DTEC,一个轻量级的自我监督的视听融合为基础的反无人机系统。AV-DTEC使用LiDAR生成的标签进行自监督学习训练,并通过并行选择性状态空间模型同时学习音频和视觉特征。通过学习的特征,特别设计的即插即用主辅助特征增强模块将视觉特征集成到音频特征中,以在交叉照明条件下具有更好的鲁棒性。为了减少对辅助特征的依赖并对齐模态,我们提出了一种教师-学生模型,该模型自适应地调整视觉特征的权重。AV-DTEC在真实世界的多模态数据中表现出卓越的准确性和有效性。代码和训练模型可在GitHub上公开访问   \url{https://github.com/AmazingDay1/AV-DETC}.
摘要:The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub  \url{https://github.com/AmazingDay1/AV-DETC}.

【6】 FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG  Distillation
标题: FADA:采用混合监督多CFC蒸馏的快速扩散阿凡达合成
链接:https://arxiv.org/abs/2412.16915
作者: Tianyun Zhong,  Chao Liang,  Jianwen Jiang,  Gaojie Lin,  Jiaqi Yang,  Zhou Zhao
摘要:基于扩散的音频驱动的说话化身方法最近因其高保真、生动和富有表现力的结果而受到关注。然而,其缓慢的推理速度限制了实际应用。尽管各种蒸馏技术的发展扩散模型,我们发现,天真的扩散蒸馏方法不产生令人满意的结果。与教师模型相比,蒸馏模型对开放集输入图像的鲁棒性降低,音频和视频之间的相关性降低,破坏了扩散模型的优势。为了解决这个问题,我们提出了FADA(快速扩散头像合成与混合监督多CFG蒸馏)。我们首先设计了一个混合监督损失,以利用不同质量的数据并增强整体模型能力和鲁棒性。此外,我们提出了一个多CFG蒸馏与可学习的令牌,利用音频和参考图像条件之间的相关性,减少了三倍的推理运行所造成的多CFG与可接受的质量下降。在多个数据集上进行的大量实验表明,FADA生成的视频与最近的基于扩散模型的方法相当,同时实现了4.17-12.5倍的NFE加速。演示可在我们的网页http://fadavatar.github.io上获得。
摘要:Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.

【7】 Temporal-Frequency State Space Duality: An Efficient Paradigm for Speech  Emotion Recognition
标题: 时频状态空间二元性:语音情感识别的有效范式
链接:https://arxiv.org/abs/2412.16904
作者: Jiaqi Zhao,  Fei Wang,  Kun Li,  Yanyan Wei,  Shengeng Tang,  Shu Zhao,  Xiao Sun
备注:Accepted by ICASSP 2025
摘要:语音情感识别(SER)在增强人机交互中的用户体验方面起着至关重要的作用。然而,现有的方法被淹没的时域分析,忽略了有价值的包络结构的频域是同样重要的鲁棒的情感识别。为了克服这一局限性,我们提出了TF-Mamba,一种新颖的多域框架,在时间和频率维度上捕获情感表达。具体来说,我们提出了一个时间-频率的Mamba块来提取时间和频率感知的情感特征,实现了计算效率和模型表达能力之间的最佳平衡。此外,我们设计了一个复杂的度量距离三元组(CMDT)损失,使模型能够捕捉代表性的情绪线索SER。IEMOCAP和MELD数据集上的广泛实验表明,TF-Mamba超越现有的方法在模型的大小和延迟,为未来的SER应用提供了一个更实用的解决方案。
摘要:Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba surpasses existing methods in terms of model size and latency, providing a more practical solution for future SER applications.

【8】 SoundLoc3D: Invisible 3D Sound Source Localization and Classification  Using a Multimodal RGB-D Acoustic Camera
标题: SoundLoc 3D:使用多模式RGB-D声学摄像机的隐形3D声音源定位和分类
链接:https://arxiv.org/abs/2412.16861
作者: Yuhang He,  Sangyun Shin,  Anoop Cherian,  Andrew Markham
备注:Accepted by WACV2025
摘要:准确定位3D声源并估计其语义标签-其中源可能不可见,但假设位于场景中对象的物理表面上-具有许多实际应用,包括检测气体泄漏和机械故障。视听弱相关在这种情况下提出了新的挑战,在推导创新的方法来回答我们是否或如何使用跨模态信息来解决这个问题。为此,我们建议使用由针孔RGB-D相机和共面四通道麦克风阵列~(Mic-Array)组成的声学相机装置。通过使用该装置记录来自多视图的视听信号,我们可以使用跨模态线索来估计声源的3D位置。具体来说,我们的框架SoundLoc 3D将任务视为集合预测问题,集合中的每个元素对应于一个潜在的声源。给定视听弱相关性,集合表示最初从单视图麦克风阵列信号学习,然后通过主动结合从多视图RGB-D图像揭示的物理表面线索来细化。我们在大规模模拟数据集上展示了SoundLoc 3D的效率和优越性,并进一步展示了其对RGB-D测量不准确和环境噪声干扰的鲁棒性。
摘要:Accurately localizing 3D sound sources and estimating their semantic labels -- where the sources may not be visible, but are assumed to lie on the physical surface of objects in the scene -- have many real applications, including detecting gas leak and machinery malfunction. The audio-visual weak-correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross-modal information to solve the task. Towards this end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array~(Mic-Array). By using this rig to record audio-visual signals from multiviews, we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically, our framework SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation, the set representation is initially learned from a single view microphone array signal, and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

【9】 Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement
标题: Mamba-SEUNet:用于单耳语音增强的Mamba UNet
链接:https://arxiv.org/abs/2412.16626
作者: Junyu Wang,  Zizhen Lin,  Tianrui Wang,  Meng Ge,  Longbiao Wang,  Jianwu Dang
摘要:在最近的语音增强(SE)研究中,Transformer及其变体已经成为主流方法。然而,自我注意机制的二次复杂性对实际部署施加了一定的限制。Mamba作为一种新型的状态空间模型(SSM),以其较强的长序列建模能力和较低的计算复杂度在自然语言处理和计算机视觉领域得到了广泛的应用。在这项工作中,我们介绍了Mamba-SEUNet,一个创新的架构,将Mamba与U-Net集成为SE任务。通过利用双向Mamba对不同分辨率下语音信号的前向和后向依赖关系进行建模,并结合跳过连接来捕获多尺度信息,我们的方法实现了最先进的(SOTA)性能。在VCTK+DEMAND数据集上的实验结果表明,Mamba-SEUNet的PESQ得分为3.59,同时保持了较低的计算复杂度。当与感知对比度拉伸技术相结合时,Mamba-SEUNet进一步将PESQ评分提高到3.73。
摘要:In recent speech enhancement (SE) research, transformer and its variants have emerged as the predominant methodologies. However, the quadratic complexity of the self-attention mechanism imposes certain limitations on practical deployment. Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision due to its strong capabilities in modeling long sequences and relatively low computational complexity. In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks. By leveraging bidirectional Mamba to model forward and backward dependencies of speech signals at different resolutions, and incorporating skip connections to capture multi-scale information, our approach achieves state-of-the-art (SOTA) performance. Experimental results on the VCTK+DEMAND dataset indicate that Mamba-SEUNet attains a PESQ score of 3.59, while maintaining low computational complexity. When combined with the Perceptual Contrast Stretching technique, Mamba-SEUNet further improves the PESQ score to 3.73.

【10】 Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech  Translation
标题: 提高直接视听语音翻译中的唇音同步性
链接:https://arxiv.org/abs/2412.16530
作者: Lucas Goncalves,  Prashant Mathur,  Xing Niu,  Brady Houston,  Chandrashekhar Lavania,  Srikanth Vishnubhotla,  Lijia Sun,  Anthony Ferritto
备注:Accepted at ICASSP, 4 pages
摘要:视听语音到语音翻译通常优先考虑提高翻译质量和自然度。然而,在视听内容中,一个同样重要的方面是嘴唇同步-确保嘴唇的动作与所说的内容相匹配-对于保持配音视频的真实感至关重要。尽管它的重要性,包括在AVS 2S模型唇同步约束已在很大程度上被忽视。这项研究通过将嘴唇同步损失整合到AVS 2S模型的训练过程中来解决这一差距。我们提出的方法显着增强了直接视听语音到语音翻译中的唇同步,实现了10.67的平均LSE-D得分,在四种语言对的强基线上,LSE-D降低了9.2%。此外,它在叠加到原始视频上时保持了翻译语音的自然度和高质量,而不会降低翻译质量。
摘要:Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.

【11】 Text2midi: Generating Symbolic Music from Captions
标题: 文本2 midi:从字幕生成象征性音乐
链接:https://arxiv.org/abs/2412.16526
作者: Keshav Bhandari,  Abhinaba Roy,  Kyra Wang,  Geeta Puri,  Simon Colton,  Dorien Herremans
备注:9 pages, 3 figures, Accepted at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)
摘要:本文介绍了text 2 midi,一个端到端的模型来生成的文本描述文件。利用多模态生成方法的日益普及,text 2 midi利用了文本数据的广泛可用性和大型语言模型(LLM)的成功。我们的端到端系统利用LLM的强大功能以MIDI文件的形式生成符号音乐。具体来说,我们利用一个预先训练的LLM编码器来处理字幕,然后条件自回归Transformer解码器产生的序列,准确地反映了所提供的描述。这种直观且用户友好的方法通过允许用户使用文本提示生成音乐片段来显著简化音乐创作过程。我们进行了全面的实证评估,结合了自动化和人类研究,表明我们的模型生成了高质量的文本文件,这些文件确实可以通过文本标题进行控制,其中可能包括和弦,键和节奏等音乐理论术语。我们在演示页面(https://github.com/AMAAI-Lab/Text2midi)上发布了代码和音乐示例,供用户与text 2 midi进行交互。
摘要:This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. We release the code and music samples on our demo page (https://github.com/AMAAI-Lab/Text2midi) for users to interact with text2midi.

【12】 Adapting Whisper for Code-Switching through Encoding Refining and  Language-Aware Decoding
标题: 通过编码精炼和数字感知解码将Whisper适应代码切换
链接:https://arxiv.org/abs/2412.16507
作者: Jiahui Zhao,  Hao Shi,  Chenrui Cui,  Tianrui Wang,  Hexin Liu,  Zhaoheng Ni,  Lingxuan Ye,  Longbiao Wang
摘要:由于口音、听觉相似性和无缝语言切换所导致的语言混淆,码切换自动语音识别面临挑战。对预训练的多语言模型的适应已经显示出CS-ASR的良好性能。在本文中,我们适应Whisper,这是一个大规模的多语种预训练的语音识别模型,CS从编码器和解码器部分。首先,我们提出了一个编码器细化,以提高编码器的能力,句内切换。其次,我们建议使用两套不同的语言提示嵌入的语言感知适配器,以实现在每个解码器层的语言特定的解码信息。然后,增加一个融合模块来融合语言感知解码。使用SEAME数据集的实验结果表明,与基线模型相比,所提出的方法在dev_man和dev_sge测试集上分别实现了4.1%和7.2%的相对MER降低,超过了最先进的方法。通过实验,我们发现该方法显著提高了CS语音中非母语的识别性能,表明该方法能够使Whisper更好地区分两种语言。
摘要:Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder's capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.

【13】 A Classification Benchmark for Artificial Intelligence Detection of  Laryngeal Cancer from Patient Speech
标题: 人工智能从患者语音检测喉癌的分类基准
链接:https://arxiv.org/abs/2412.16267
作者: Mary Paterson,  James Moor,  Luisa Cutillo
备注:24 pages, 6 figures, 7 tables
摘要:据预测,喉癌病例在未来几年将显著增加。目前的诊断途径导致许多患者被错误地转诊到紧急疑似癌症途径,给患者和医疗系统带来了不必要的压力。   人工智能提供了一个很有前途的解决方案,可以从患者的语音中无创检测喉癌,这可以帮助更有效地优先考虑转诊,并减少非癌症患者的不适当转诊。为了实现这一潜力,开放科学至关重要。该领域的一个主要障碍是缺乏开源数据集和可复制的基准,迫使研究人员从头开始。我们的工作通过引入一个基准套件来解决这一挑战,该套件包括在开源数据集上训练和评估的36个模型。这些模型可以在公共存储库中访问,为未来的研究提供基础。他们评估了三种不同的算法和三种音频特征集,提供了一个全面的基准测试框架。我们提出了标准化的指标和评估方法,以确保未来研究的一致性和可比性。   所提出的模型包括仅音频输入和多模态输入,其中包含人口统计和症状数据,使其能够应用于具有不同患者信息的数据集。通过提供这些基准,未来的研究人员可以评估他们的数据集,完善模型,并将其作为更先进方法的基础。这项工作旨在为建立可重复的基准提供基线,使研究人员能够将新方法与这些标准进行比较,并最终推进用于检测喉癌的人工智能工具的开发。
摘要:Cases of laryngeal cancer are predicted to rise significantly in the coming years. Current diagnostic pathways cause many patients to be incorrectly referred to urgent suspected cancer pathways, putting undue stress on both patients and the medical system.  Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient speech, which could help prioritise referrals more effectively and reduce inappropriate referrals of non-cancer patients. To realise this potential, open science is crucial. A major barrier in this field is the lack of open-source datasets and reproducible benchmarks, forcing researchers to start from scratch. Our work addresses this challenge by introducing a benchmark suite comprising 36 models trained and evaluated on open-source datasets. These models are accessible in a public repository, providing a foundation for future research. They evaluate three different algorithms and three audio feature sets, offering a comprehensive benchmarking framework. We propose standardised metrics and evaluation methodologies to ensure consistent and comparable results across future studies.  The presented models include both audio-only inputs and multimodal inputs that incorporate demographic and symptom data, enabling their application to datasets with diverse patient information. By providing these benchmarks, future researchers can evaluate their datasets, refine the models, and use them as a foundation for more advanced approaches. This work aims to provide a baseline for establishing reproducible benchmarks, enabling researchers to compare new methods against these standards and ultimately advancing the development of AI tools for detecting laryngeal cancer.

【14】 Decoding Poultry Vocalizations -- Natural Language Processing and  Transformer Models for Semantic and Emotional Analysis
标题: 家禽发声解码--用于语义和情感分析的自然语言处理和Transformer模型
链接:https://arxiv.org/abs/2412.16182
作者: Venkatraman Manikandan,  Suresh Neethirajan
备注:28 Pages, 14 figures
摘要:破译鸡的声音语言为动物福利和生态信息学提供了新的机会。他们微妙的声音信号编码健康状况,情绪状态和生态系统内的动态互动。理解这些叫声的语义为解释它们的功能词汇和澄清每个声音在社会和环境背景下如何服务于特定目的提供了一个有价值的工具。我们应用先进的自然语言处理和基于Transformer的模型将生物声学数据转化为有意义的见解。我们的方法集成了Wave2Vec 2.0用于原始音频特征提取,并从Transformers模型中微调双向编码器表示,在广泛的动物声音语料库上进行预训练,并适应家禽任务。该管道将家禽发声解码为可解释的类别,包括求救信号,喂食信号和交配发声,揭示了传统分析经常忽略的情感细微差别。在对关键发声类型进行分类方面,我们的方法达到了92%的准确率,证明了实时自动监测羊群健康和压力的可行性。通过跟踪这些功能性词汇,农民可以主动应对环境或行为变化,改善家禽福利,减少与压力相关的生产力损失,并支持更可持续的农场管理。除了农业,这项研究还增强了我们对计算生态学的理解。动物叫声的语义基础可能表明生物多样性,环境压力和物种相互作用,为综合生态系统水平的决策提供信息。
摘要:Deciphering the acoustic language of chickens offers new opportunities in animal welfare and ecological informatics. Their subtle vocal signals encode health conditions, emotional states, and dynamic interactions within ecosystems. Understanding the semantics of these calls provides a valuable tool for interpreting their functional vocabulary and clarifying how each sound serves a specific purpose in social and environmental contexts. We apply advanced Natural Language Processing and transformer based models to translate bioacoustic data into meaningful insights. Our method integrates Wave2Vec 2.0 for raw audio feature extraction with a fine tuned Bidirectional Encoder Representations from Transformers model, pretrained on a broad corpus of animal sounds and adapted to poultry tasks. This pipeline decodes poultry vocalizations into interpretable categories including distress calls, feeding signals, and mating vocalizations, revealing emotional nuances often overlooked by conventional analyses. Achieving 92 percent accuracy in classifying key vocalization types, our approach demonstrates the feasibility of real time automated monitoring of flock health and stress. By tracking this functional vocabulary, farmers can respond proactively to environmental or behavioral changes, improving poultry welfare, reducing stress related productivity losses, and supporting more sustainable farm management. Beyond agriculture, this research enhances our understanding of computational ecology. Accessing the semantic foundation of animal calls may indicate biodiversity, environmental stressors, and species interactions, informing integrative ecosystem level decision making.

【15】 Efficient VoIP Communications through LLM-based Real-Time Speech  Reconstruction and Call Prioritization for Emergency Services
标题: 通过基于LLM的实时语音重建和紧急服务呼叫优先级实现高效的IP电话通信
链接:https://arxiv.org/abs/2412.16176
作者: Danush Venkateshperumal,  Rahman Abdul Rafi,  Shakil Ahmed,  Ashfaq Khokhar
备注:15 pages,8 figures
摘要:紧急通信系统面临由于VoIP系统中的分组丢失、带宽限制、差的信号质量、延迟和抖动而导致的中断,从而导致实时服务质量下降。由于恐慌、语言障碍和背景噪音,遇难者往往难以传达关键信息,这进一步使调度员准确评估情况的能力变得复杂。急救中心的人员短缺加剧了协调和援助的延误。本文建议利用大型语言模型(LLM)通过重建不完整的语音、填补上下文差距以及根据严重程度对呼叫进行优先级排序来解决这些挑战。该系统将实时转录与检索增强生成(RAG)集成在一起,以生成上下文响应,并使用Twilio和AssemblyAI API进行无缝实现。评估结果显示,该模型具有高精度、良好的BLEU和ROUGE评分,并与现实世界的需求保持一致,表明该模型具有优化应急响应工作流程和有效优先处理关键案例的潜力。
摘要:Emergency communication systems face disruptions due to packet loss, bandwidth constraints, poor signal quality, delays, and jitter in VoIP systems, leading to degraded real-time service quality. Victims in distress often struggle to convey critical information due to panic, speech disorders, and background noise, further complicating dispatchers' ability to assess situations accurately. Staffing shortages in emergency centers exacerbate delays in coordination and assistance. This paper proposes leveraging Large Language Models (LLMs) to address these challenges by reconstructing incomplete speech, filling contextual gaps, and prioritizing calls based on severity. The system integrates real-time transcription with Retrieval-Augmented Generation (RAG) to generate contextual responses, using Twilio and AssemblyAI APIs for seamless implementation. Evaluation shows high precision, favorable BLEU and ROUGE scores, and alignment with real-world needs, demonstrating the model's potential to optimize emergency response workflows and prioritize critical cases effectively.

【16】 Investigating Prosodic Signatures via Speech Pre-Trained Models for  Audio Deepfake Source Attribution
标题: 通过语音预训练模型调查韵律签名用于音频Deepfake源属性
链接:https://arxiv.org/abs/2412.17796
作者: Orchid Chetia Phukan,  Drishti Singh,  Swarup Ranjan Behera,  Arun Balaji Buduru,  Rajesh Sharma
摘要:在这项工作中,我们研究了各种最先进的(SOTA)语音预训练模型(PTM),以了解它们捕获生成源的韵律签名以进行音频深度伪造源归因(ADSD)的能力。这些韵律特征可以被认为是ADSD的主要特征之一,这是每个源所独有的。因此,PTM在捕获韵律符号方面的表现更好,ADSD表现更好。我们考虑了在不同韵律任务中表现出最佳性能的各种SOTA PTM,用于我们在基准数据集ASVSpoof 2019和CFAD上的实验。x-向量(说话人识别PTM)达到最高的性能相比,所有的PTM考虑,尽管组成最低的模型参数。这种更高的性能可能是由于其说话人识别预训练,使其能够以更好的方式捕获源的独特韵律特征。此外,从音频deepfake检测和语音识别等任务的动机,其中PTM表示的融合导致性能的提高,我们探索了同样的问题,并提出了FINDER来有效融合这些表示。通过FINDER融合Whisper和x矢量表示,我们实现了与所有单个PTM以及基线融合技术相比的最高性能,并实现了SOTA性能。
摘要:In this work, we investigate various state-of-the-art (SOTA) speech pre-trained models (PTMs) for their capability to capture prosodic signatures of the generative sources for audio deepfake source attribution (ADSD). These prosodic characteristics can be considered one of major signatures for ADSD, which is unique to each source. So better is the PTM at capturing prosodic signs better the ADSD performance. We consider various SOTA PTMs that have shown top performance in different prosodic tasks for our experiments on benchmark datasets, ASVSpoof 2019 and CFAD. x-vector (speaker recognition PTM) attains the highest performance in comparison to all the PTMs considered despite consisting lowest model parameters. This higher performance can be due to its speaker recognition pre-training that enables it for capturing unique prosodic characteristics of the sources in a better way. Further, motivated from tasks such as audio deepfake detection and speech recognition, where fusion of PTMs representations lead to improved performance, we explore the same and propose FINDER for effective fusion of such representations. With fusion of Whisper and x-vector representations through FINDER, we achieved the topmost performance in comparison to all the individual PTMs as well as baseline fusion techniques and attaining SOTA performance.

【17】 Analysis of Speech Temporal Dynamics in the Context of Speaker  Verification and Voice Anonymization
标题: 说话人验证和语音同步背景下的语音时间动态分析
链接:https://arxiv.org/abs/2412.17164
作者: Natalia Tomashenko,  Emmanuel Vincent,  Marc Tommasi
备注:Accepted at ICASSP 2025
摘要:在本文中,我们研究的影响,语音时间动态应用到自动说话人确认和说话人语音匿名任务。我们提出了几个指标来执行自动说话人验证的基础上,只有音素持续时间。实验结果表明,音素时长泄露了一些说话人信息,并且可以从原始语音和匿名语音中揭示说话人身份。因此,这项工作强调的重要性,考虑到扬声器的语音速率,更重要的是,扬声器的语音持续时间的特点,以及需要修改它们,以开发匿名系统具有较强的隐私保护能力。
摘要:In this paper, we investigate the impact of speech temporal dynamics in application to automatic speaker verification and speaker voice anonymization tasks. We propose several metrics to perform automatic speaker verification based only on phoneme durations. Experimental results demonstrate that phoneme durations leak some speaker information and can reveal speaker identity from both original and anonymized speech. Thus, this work emphasizes the importance of taking into account the speaker's speech rate and, more importantly, the speaker's phonetic duration characteristics, as well as the need to modify them in order to develop anonymization systems with strong privacy protection capacity.

【18】 Uncovering the Visual Contribution in Audio-Visual Speech Recognition
标题: 揭示视听语音识别中的视觉贡献
链接:https://arxiv.org/abs/2412.17129
作者: Zhaofeng Lin,  Naomi Harte
备注:5 pages, 2 figures. Accepted to ICASSP 2025
摘要:视听语音识别(AVSR)结合了听觉和视觉语音线索,以提高语音识别系统的准确性和鲁棒性。AVSR的最新进展与仅音频的同行相比,提高了在嘈杂环境中的性能。然而,视觉贡献的真实程度,以及AVSR系统是否充分利用了视觉领域中的可用线索,仍然不清楚。本文评估AVSR系统从不同的角度来看,考虑人类的语音感知。我们使用三个系统:Auto-AVSR,AVEC和AV-RelScore。我们首先量化的视觉贡献,使用有效的SNR增益在0 dB,然后调查使用视觉信息的时间分布和字级信息。我们表明,低WER并不能保证高SNR增益。我们的研究结果表明,目前的方法并没有充分利用视觉信息,我们建议未来的研究报告有效的SNR增益WER。
摘要:Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to audio-only counterparts. However, the true extent of the visual contribution, and whether AVSR systems fully exploit the available cues in the visual domain, remains unclear. This paper assesses AVSR systems from a different perspective, by considering human speech perception. We use three systems: Auto-AVSR, AVEC and AV-RelScore. We first quantify the visual contribution using effective SNR gains at 0 dB and then investigate the use of visual information in terms of its temporal distribution and word-level informativeness. We show that low WER does not guarantee high SNR gains. Our results suggest that current methods do not fully exploit visual information, and we recommend future research to report effective SNR gains alongside WERs.

【19】 Scalable Speech Enhancement with Dynamic Channel Pruning
标题: 利用动态通道修剪的可扩展语音增强
链接:https://arxiv.org/abs/2412.17121
作者: Riccardo Miccini,  Clement Laroche,  Tobias Piechowiak,  Luca Pezzarossa
备注:Accepted for publication at the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
摘要:语音增强(SE)对于提高远程协作环境中的工作效率至关重要。虽然深度学习模型在SE中非常有效,但它们的计算需求使它们对嵌入式系统不切实际。此外,声学条件在难度方面可以显著变化,而神经网络在执行的计算量方面通常是静态的。为此,我们首次将动态通道修剪引入音频域,并将其应用于SE的自定义卷积架构。我们的方法通过在运行时识别不必要的卷积通道,并通过不计算这些通道的激活并检索它们的滤波器来节省计算资源。当训练只使用25%的通道时,我们节省了29.6%的MAC,而PESQ只下降了0.75%。因此,DynCP为在资源受限的设备上部署更大、更强大的SE解决方案提供了一条有前途的道路。
摘要:Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save 29.6% of MACs while only causing a 0.75% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.

【20】 Why Do Speech Language Models Fail to Generate Semantically Coherent  Outputs? A Modality Evolving Perspective
标题: 为什么语音语言模型无法生成语义连贯的输出?情态演变的视角
链接:https://arxiv.org/abs/2412.17048
作者: Hankun Wang,  Haoran Wang,  Yiwei Guo,  Zhihan Li,  Chenpeng Du,  Xie Chen,  Kai Yu
摘要:虽然基于文本的大型语言模型表现出人类水平的写作能力和显着的智能,语音语言模型(SLM)仍然难以生成语义连贯的输出。这种性能下降有几个潜在的原因:(A)语音标记主要提供语音信息,而不是语义信息,(B)语音序列的长度比文本序列的长度长得多,以及(C)非语言信息,如韵律,引入了额外的复杂性和可变性。本文从语篇到言语的情态转换过程出发,分别探讨了三个关键因素对情态转换的影响。我们的研究结果表明,这三个因素的影响各不相同。因素A的影响相对较小,因素B对句法和语义建模的影响较为明显,而因素C的影响最为显著,尤其是在基本词汇建模方面。基于这些发现,我们深入了解了培训SLM的独特挑战,并强调了开发更有效的端到端SLM的途径。
摘要:Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.

【21】 Autoregressive Speech Synthesis with Next-Distribution Prediction
标题: 具有下一分布预测的自回归语音合成
链接:https://arxiv.org/abs/2412.16846
作者: Xinfa Zhu,  Wenjie Tian,  Lei Xie
备注:Technical report, work in progress
摘要:我们介绍了KALL-E,一种新的自回归(AR)语言建模方法与下一个分布预测的文本到语音(TTS)合成。与现有方法不同,KALL-E直接建模和预测以文本为条件的连续语音分布,而不依赖于基于VAE或扩散的组件。具体来说,我们使用WaveVAE从波形中提取连续的语音分布,而不是使用离散的语音标记。一个单一的AR语言模型预测这些连续的语音分布从文本中,与Kullback-Leibler发散损失作为约束。实验结果表明,KALL-E在zero-shot TTS场景下的自然度和说话人相似度方面优于YourTTS、VALL-E、NaturalSpeech 2和CosyVoice的开源实现。此外,KALL-E在情感和口音克隆方面展示了卓越的zero-shot能力。重要的是,KALL-E为TTS中使用连续语音表示提供了一个更直接有效的范例。音频示例可在以下位置获得:\url{https://zxf-icpc.github.io/kalle/}。
摘要:We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS. Audio samples are available at: \url{https://zxf-icpc.github.io/kalle/}.

eess.AS音频处理

【1】 Investigating Prosodic Signatures via Speech Pre-Trained Models for  Audio Deepfake Source Attribution
标题: 通过语音预训练模型调查韵律签名用于音频Deepfake源属性
链接:https://arxiv.org/abs/2412.17796
作者: Orchid Chetia Phukan,  Drishti Singh,  Swarup Ranjan Behera,  Arun Balaji Buduru,  Rajesh Sharma
摘要:在这项工作中,我们研究了各种最先进的(SOTA)语音预训练模型(PTM),以了解它们捕获生成源的韵律签名以进行音频深度伪造源归因(ADSD)的能力。这些韵律特征可以被认为是ADSD的主要特征之一,它对每个来源来说都是独一无二的。因此,PTM在捕获韵律符号方面的表现更好,ADSD表现更好。我们考虑了在不同韵律任务中表现出最佳性能的各种SOTA PTM,用于我们在基准数据集ASVSpoof 2019和CFAD上的实验。x-向量(说话人识别PTM)达到最高的性能相比,所有的PTM考虑,尽管组成最低的模型参数。这种更高的性能可能是由于其说话人识别预训练,使其能够以更好的方式捕获源的独特韵律特征。此外,从音频deepfake检测和语音识别等任务的动机,其中PTM表示的融合导致性能的提高,我们探索了同样的问题,并提出了FINDER来有效融合这些表示。通过FINDER融合Whisper和x矢量表示,我们实现了与所有单个PTM以及基线融合技术相比的最高性能,并实现了SOTA性能。
摘要:In this work, we investigate various state-of-the-art (SOTA) speech pre-trained models (PTMs) for their capability to capture prosodic signatures of the generative sources for audio deepfake source attribution (ADSD). These prosodic characteristics can be considered one of major signatures for ADSD, which is unique to each source. So better is the PTM at capturing prosodic signs better the ADSD performance. We consider various SOTA PTMs that have shown top performance in different prosodic tasks for our experiments on benchmark datasets, ASVSpoof 2019 and CFAD. x-vector (speaker recognition PTM) attains the highest performance in comparison to all the PTMs considered despite consisting lowest model parameters. This higher performance can be due to its speaker recognition pre-training that enables it for capturing unique prosodic characteristics of the sources in a better way. Further, motivated from tasks such as audio deepfake detection and speech recognition, where fusion of PTMs representations lead to improved performance, we explore the same and propose FINDER for effective fusion of such representations. With fusion of Whisper and x-vector representations through FINDER, we achieved the topmost performance in comparison to all the individual PTMs as well as baseline fusion techniques and attaining SOTA performance.

【2】 An Investigation on the Potential of KAN in Speech Enhancement
标题: KAN在语音增强中的潜力研究
链接:https://arxiv.org/abs/2412.17778
作者: Haoyang Li,  Yuchen Hu,  Chen Chen,  Eng Siong Chng
备注:5 pages, 2 figure, 4 tables
摘要:高保真语音增强通常需要复杂的建模来捕获复杂的多尺度模式。标准激活函数虽然引入了非线性,但缺乏充分解决这种复杂性的灵活性。Kolmogorov-Arnold网络(KAN)是一种新兴的方法,它在图的边缘上使用可学习的激活函数,提供了一种有前途的替代方案。本文研究了两种基于有理基函数和径向基函数的新型KAN语音增强方法。我们将有理变体集成到Demucs的1D CNN块和MP-SENet的GRU-Transformer块中,而径向变体适用于MP-SENet的2D基于CNN的解码器。在VoiceBank-DEMAND数据集上的实验表明,用基于KAN的激活来替换标准激活可以提高时域和时频域方法的语音质量,对模型大小和FLOP的影响最小,这突出了KAN改进语音增强模型的潜力。
摘要:High-fidelity speech enhancement often requires sophisticated modeling to capture intricate, multiscale patterns. Standard activation functions, while introducing nonlinearity, lack the flexibility to fully address this complexity. Kolmogorov-Arnold Networks (KAN), an emerging methodology that employs learnable activation functions on graph edges, present a promising alternative. This work investigates two novel KAN variants based on rational and radial basis functions for speech enhancement. We integrate the rational variant into the 1D CNN blocks of Demucs and the GRU-Transformer blocks of MP-SENet, while the radial variant is adapted to the 2D CNN-based decoders of MP-SENet. Experiments on the VoiceBank-DEMAND dataset show that replacing standard activations with KAN-based activations improves speech quality across both the time-domain and time-frequency domain methods with minimal impact on model size and FLOP, underscoring KAN's potential to improve speech enhancement models.

【3】 UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic  Speech Recognition
标题: UME:升级专家混合,实现可扩展和高效的自动语音识别
链接:https://arxiv.org/abs/2412.17507
作者: Li Fu,  Shanyong Yu,  Siqi Li,  Lu Fan,  Youzheng Wu,  Xiaodong He
备注:ICASSP 2025
摘要:最近在按比例放大模型方面的进展显着提高了自动语音识别(ASR)任务的性能。然而,从头开始训练大型ASR模型仍然是昂贵的。为了解决这个问题,我们引入了UME,这是一种新的方法,可以有效地将预训练的密集ASR检查点升级到更大的专家混合(MoE)架构中。最初,前馈网络被转换成MoE层。通过重复使用预训练的权重,我们为扩展模型建立了稳健的基础,从而显着减少了优化时间。然后,采用层冻结和专家平衡策略来继续训练模型,进一步提高性能。在17万小时的中文和英文数据集上进行的实验表明,UME:1)在保持相当延迟的同时,相对错误率降低了11.9%,超过了预训练基线; 2)与从头开始训练相同大小的模型相比,训练时间减少了86.7%,并实现了更高的准确性。
摘要:Recent advancements in scaling up models have significantly improved performance in Automatic Speech Recognition (ASR) tasks. However, training large ASR models from scratch remains costly. To address this issue, we introduce UME, a novel method that efficiently Upcycles pretrained dense ASR checkpoints into larger Mixture-of-Experts (MoE) architectures. Initially, feed-forward networks are converted into MoE layers. By reusing the pretrained weights, we establish a robust foundation for the expanded model, significantly reducing optimization time. Then, layer freezing and expert balancing strategies are employed to continue training the model, further enhancing performance. Experiments on a mixture of 170k-hour Mandarin and English datasets show that UME: 1) surpasses the pretrained baseline by a margin of 11.9% relative error rate reduction while maintaining comparable latency; 2) reduces training time by up to 86.7% and achieves superior accuracy compared to training models of the same size from scratch.

【4】 Domain-Incremental Learning for Audio Classification
标题: 音频分类领域增量学习
链接:https://arxiv.org/abs/2412.17424
作者: Manjunath Mulimani,  Annamaria Mesaros
备注:Accepted to ICASSP 2025
摘要:在这项工作中,我们提出了一种方法域增量学习音频分类从一系列的数据集记录在不同的声学条件。在一系列不断发展的领域或数据集上微调模型会导致忘记以前学到的知识。另一方面,冻结模型的所有层会导致模型无法适应新的域。在这项工作中,我们的新的动态网络架构保持共享的均匀的声学特性的域,并学习特定于域的声学特性在增量步骤。我们的方法实现了很好的平衡之间保留以前学到的领域的知识,并获得新的领域的知识。我们证明了所提出的方法对欧洲城市和韩国的声学场景的单标签分类的增量学习的有效性,以及对Audioset和FSD 50 K数据集的音频记录的多标签分类的有效性。该方法学习对声学场景进行增量分类,平均准确率为71.9%:欧洲城市->韩国,韩国->欧洲城市为83.4%。在多标签音频分类设置中,它实现了Audioset -> FSD 50 K的47.5%的平均lwlrap和FSD 50 K-> Audioset的40.7%。
摘要:In this work, we propose a method for domain-incremental learning for audio classification from a sequence of datasets recorded in different acoustic conditions. Fine-tuning a model on a sequence of evolving domains or datasets leads to forgetting of previously learned knowledge. On the other hand, freezing all the layers of the model leads to the model not adapting to the new domain. In this work, our novel dynamic network architecture keeps the shared homogeneous acoustic characteristics of domains, and learns the domain-specific acoustic characteristics in incremental steps. Our approach achieves a good balance between retaining the knowledge of previously learned domains and acquiring the knowledge of the new domain. We demonstrate the effectiveness of the proposed method on incremental learning of single-label classification of acoustic scenes from European cities and Korea, and multi-label classification of audio recordings from Audioset and FSD50K datasets. The proposed approach learns to classify acoustic scenes incrementally with an average accuracy of 71.9% for the order: European cities -> Korea, and 83.4% for Korea -> European cities. In a multi-label audio classification setup, it achieves an average lwlrap of 47.5% for Audioset -> FSD50K and 40.7% for FSD50K -> Audioset.

【5】 Analysis of Speech Temporal Dynamics in the Context of Speaker  Verification and Voice Anonymization
标题: 说话人验证和语音同步背景下的语音时间动态分析
链接:https://arxiv.org/abs/2412.17164
作者: Natalia Tomashenko,  Emmanuel Vincent,  Marc Tommasi
备注:Accepted at ICASSP 2025
摘要:在本文中,我们研究的影响,语音时间动态应用到自动说话人确认和说话人语音匿名任务。我们提出了几个指标来执行自动说话人验证的基础上,只有音素持续时间。实验结果表明,音素持续时间泄漏一些说话人信息,可以揭示说话人身份从原始和匿名的语音。因此,这项工作强调的重要性,考虑到扬声器的语音速率,更重要的是,扬声器的语音持续时间的特点,以及需要修改它们,以开发匿名系统具有较强的隐私保护能力。
摘要:In this paper, we investigate the impact of speech temporal dynamics in application to automatic speaker verification and speaker voice anonymization tasks. We propose several metrics to perform automatic speaker verification based only on phoneme durations. Experimental results demonstrate that phoneme durations leak some speaker information and can reveal speaker identity from both original and anonymized speech. Thus, this work emphasizes the importance of taking into account the speaker's speech rate and, more importantly, the speaker's phonetic duration characteristics, as well as the need to modify them in order to develop anonymization systems with strong privacy protection capacity.

【6】 Tandem spoofing-robust automatic speaker verification based on  time-domain embeddings
标题: 基于时间域嵌入的串联欺骗鲁棒性自动说话人验证
链接:https://arxiv.org/abs/2412.17133
作者: Avishai Weizman,  Yehuda Ben-Shimol,  Itshak Lapidot
备注:11 pages, 8 figures
摘要:防欺骗的自动说话人验证(SASV)系统是防止欺骗语音的关键技术。在这项研究中,我们专注于逻辑访问攻击,并介绍了一种新的方法来SASV任务。基于时域波形幅度的概率质量函数(PMF),提出了一种新的真实语音和欺骗语音的表示方法。这种方法产生新的时间嵌入来自训练集内选定的组的PMF。本文强调了性别隔离的作用及其对业绩的积极影响。我们提出了一个对策(CM)系统,采用时域嵌入来自PMF的欺骗和真正的语音,以及性别识别的基础上,男性和女性的时间为基础的嵌入。该方法具有显著的性别识别能力,男性和女性的错配率分别为0.94%和1.79%。男性和女性CM系统分别达到8.67%和10.12%的等错误率(EER)。通过将这种方法与传统的说话人验证系统相结合,我们使用ASVspoof2019挑战数据库展示了改进的泛化能力和串联检测成本函数评估。此外,我们研究了融合的时间嵌入方法与传统的CM的影响,并说明了这种融合如何增强SASV架构的泛化。
摘要:Spoofing-robust automatic speaker verification (SASV) systems are a crucial technology for the protection against spoofed speech. In this study, we focus on logical access attacks and introduce a novel approach to SASV tasks. A novel representation of genuine and spoofed speech is employed, based on the probability mass function (PMF) of waveform amplitudes in the time domain. This methodology generates novel time embeddings derived from the PMF of selected groups within the training set. This paper highlights the role of gender segregation and its positive impact on performance. We propose a countermeasure (CM) system that employs time-domain embeddings derived from the PMF of spoofed and genuine speech, as well as gender recognition based on male and female time-based embeddings. The method exhibits notable gender recognition capabilities, with mismatch rates of 0.94% and 1.79% for males and females, respectively. The male and female CM systems achieve an equal error rate (EER) of 8.67% and 10.12%, respectively. By integrating this approach with traditional speaker verification systems, we demonstrate improved generalization ability and tandem detection cost function evaluation using the ASVspoof2019 challenge database. Furthermore, we investigate the impact of fusing the time embedding approach with traditional CM and illustrate how this fusion enhances generalization in SASV architectures.

【7】 Uncovering the Visual Contribution in Audio-Visual Speech Recognition
标题: 揭示视听语音识别中的视觉贡献
链接:https://arxiv.org/abs/2412.17129
作者: Zhaofeng Lin,  Naomi Harte
备注:5 pages, 2 figures. Accepted to ICASSP 2025
摘要:视听语音识别(AVSR)结合了听觉和视觉语音线索,以提高语音识别系统的准确性和鲁棒性。AVSR的最新进展与仅音频的同行相比,提高了在嘈杂环境中的性能。然而,视觉贡献的真实程度,以及AVSR系统是否充分利用了视觉领域中的可用线索,仍然不清楚。本文评估AVSR系统从不同的角度来看,考虑人类的语音感知。我们使用三个系统:Auto-AVSR,AVEC和AV-RelScore。我们首先量化的视觉贡献,使用有效的SNR增益在0 dB,然后调查使用视觉信息的时间分布和字级信息。我们表明,低WER并不能保证高SNR增益。我们的研究结果表明,目前的方法并没有充分利用视觉信息,我们建议未来的研究报告有效的SNR增益WER。
摘要:Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to audio-only counterparts. However, the true extent of the visual contribution, and whether AVSR systems fully exploit the available cues in the visual domain, remains unclear. This paper assesses AVSR systems from a different perspective, by considering human speech perception. We use three systems: Auto-AVSR, AVEC and AV-RelScore. We first quantify the visual contribution using effective SNR gains at 0 dB and then investigate the use of visual information in terms of its temporal distribution and word-level informativeness. We show that low WER does not guarantee high SNR gains. Our results suggest that current methods do not fully exploit visual information, and we recommend future research to report effective SNR gains alongside WERs.

【8】 Scalable Speech Enhancement with Dynamic Channel Pruning
标题: 利用动态通道修剪的可扩展语音增强
链接:https://arxiv.org/abs/2412.17121
作者: Riccardo Miccini,  Clement Laroche,  Tobias Piechowiak,  Luca Pezzarossa
备注:Accepted for publication at the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
摘要:语音增强(SE)对于提高远程协作环境中的生产力至关重要。虽然深度学习模型在SE中非常有效,但它们的计算需求使它们对嵌入式系统不切实际。此外,声学条件在难度方面可以显著变化,而神经网络在执行的计算量方面通常是静态的。为此,我们首次将动态通道修剪引入音频域,并将其应用于SE的自定义卷积架构。我们的方法通过在运行时识别不必要的卷积通道,并通过不计算这些通道的激活并检索它们的滤波器来节省计算资源。当训练只使用25%的通道时,我们节省了29.6%的MAC,而PESQ只下降了0.75%。因此,DynCP为在资源受限的设备上部署更大、更强大的SE解决方案提供了一条有前途的道路。
摘要:Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save 29.6% of MACs while only causing a 0.75% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.

【9】 Why Do Speech Language Models Fail to Generate Semantically Coherent  Outputs? A Modality Evolving Perspective
标题: 为什么语音语言模型无法生成语义连贯的输出?情态演变的视角
链接:https://arxiv.org/abs/2412.17048
作者: Hankun Wang,  Haoran Wang,  Yiwei Guo,  Zhihan Li,  Chenpeng Du,  Xie Chen,  Kai Yu
摘要:虽然基于文本的大型语言模型表现出人类水平的写作能力和显着的智能,语音语言模型(SLM)仍然难以生成语义连贯的输出。这种性能下降有几个潜在的原因:(A)语音标记主要提供语音信息,而不是语义信息,(B)语音序列的长度比文本序列的长度长得多,以及(C)非语言信息,如韵律,引入了额外的复杂性和可变性。本文从语篇到言语的情态转换过程出发,分别探讨了三个关键因素对情态转换的影响。我们的研究结果表明,这三个因素的影响各不相同。因素A的影响相对较小,因素B对句法和语义建模的影响较为明显,而因素C的影响最为显著,尤其是在基本词汇建模方面。基于这些发现,我们深入了解了培训SLM的独特挑战,并强调了开发更有效的端到端SLM的途径。
摘要:Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.

【10】 Incremental Disentanglement for Environment-Aware Zero-Shot  Text-to-Speech Synthesis
标题: 环境感知Zero-Shot文本到语音合成的增量解纠缠
链接:https://arxiv.org/abs/2412.16977
作者: Ye-Xin Lu,  Hui-Peng Du,  Zheng-Yan Sheng,  Yang Ai,  Zhen-Hua Ling
备注:Accepted to ICASSP 2025
摘要:本文提出了一种基于增量解纠缠的环境感知zero-shot文语转换(TTS)方法,简称IDEA-TTS,该方法可以在保持给定环境参考语音声学特性的同时,合成未知说话人的语音。IDEA-TTS采用VITS作为TTS主干。为了有效地解开的环境,扬声器,和文本因素,我们提出了一个增量的解开过程中,环境估计器的设计,首先分解成一个环境掩码和增强的频谱环境频谱图。然后由环境编码器处理环境掩码以提取环境嵌入,而增强的频谱图促进随后的说话者和文本因素与说话者嵌入的条件的解纠缠,其使用预训练的环境鲁棒的说话者编码器从环境语音中提取。最后,说话人和环境嵌入都被调节到解码器中,用于环境感知语音生成。实验结果表明,IDEA-TTS在环境感知TTS任务中取得了优异的性能,在语音质量,说话人相似性和环境相似性方面表现出色。此外,IDEA-TTS还能够进行声环境转换任务,并实现了最先进的性能。
摘要:This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environment embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained environment-robust speaker encoder. Finally, both the speaker and environment embeddings are conditioned into the decoder for environment-aware speech generation. Experimental results demonstrate that IDEA-TTS achieves superior performance in the environment-aware TTS task, excelling in speech quality, speaker similarity, and environmental similarity. Additionally, IDEA-TTS is also capable of the acoustic environment conversion task and achieves state-of-the-art performance.

【11】 Speech-Based Depression Prediction Using Encoder-Weight-Only Transfer  Learning and a Large Corpus
标题: 使用仅编码器权重迁移学习和大型数据库进行基于言语的抑郁症预测
链接:https://arxiv.org/abs/2412.16900
作者: Amir Harati,  Elizabeth Shriberg,  Tomasz Rutowski,  Piotr Chlebek,  Yang Lu,  Ricardo Oliveira
备注:None
摘要:基于语音的算法已经引起了人们对抑郁症等行为健康状况管理的兴趣。我们探索了一种基于语音的迁移学习方法,该方法使用轻量级编码器,仅传输编码器权重,从而实现简化的运行时模型。我们的研究使用了一个大型数据集,其中包含的发言者和会话数量比以前的工作多了大约两个数量级。大数据集使得能够可靠地估计迁移学习的改进。PHQ-8标签的预测结果显示,二进制分类的相对性能增益高达27%;这些增益具有统计学显著性,p值接近于零。还发现了回归的改进。此外,从迁移学习中获得的收益似乎并不需要强大的源任务性能。结果表明,这种方法是灵活的,并提供了有效的实施的承诺。
摘要:Speech-based algorithms have gained interest for the management of behavioral health conditions such as depression. We explore a speech-based transfer learning approach that uses a lightweight encoder and that transfers only the encoder weights, enabling a simplified run-time model. Our study uses a large data set containing roughly two orders of magnitude more speakers and sessions than used in prior work. The large data set enables reliable estimation of improvement from transfer learning. Results for the prediction of PHQ-8 labels show up to 27% relative performance gains for binary classification; these gains are statistically significant with a p-value close to zero. Improvements were also found for regression. Additionally, the gain from transfer learning does not appear to require strong source task performance. Results suggest that this approach is flexible and offers promise for efficient implementation.

【12】 Autoregressive Speech Synthesis with Next-Distribution Prediction
标题: 具有下一分布预测的自回归语音合成
链接:https://arxiv.org/abs/2412.16846
作者: Xinfa Zhu,  Wenjie Tian,  Lei Xie
备注:Technical report, work in progress
摘要:我们介绍了KALL-E,一种新的自回归(AR)语言建模方法与下一个分布预测的文本到语音(TTS)合成。与现有方法不同,KALL-E直接建模和预测以文本为条件的连续语音分布,而不依赖于基于VAE或扩散的组件。具体来说,我们使用WaveVAE从波形中提取连续的语音分布,而不是使用离散的语音标记。一个单一的AR语言模型预测这些连续的语音分布从文本中,与Kullback-Leibler发散损失作为约束。实验结果表明,KALL-E在zero-shot TTS场景下的自然度和说话人相似度方面优于YourTTS、VALL-E、NaturalSpeech 2和CosyVoice的开源实现。此外,KALL-E在情感和口音克隆方面展示了卓越的zero-shot能力。重要的是,KALL-E为TTS中使用连续语音表示提供了一个更直接有效的范例。音频示例可在以下位置获得:\url{https://zxf-icpc.github.io/kalle/}。
摘要:We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS. Audio samples are available at: \url{https://zxf-icpc.github.io/kalle/}.

【13】 Time-Graph Frequency Representation with Singular Value Decomposition  for Neural Speech Enhancement
标题: 神经语音增强的奇异值分解时间图频率表示
链接:https://arxiv.org/abs/2412.16823
作者: Tingting Wang,  Tianrui Wang,  Meng Ge,  Qiquan Zhang,  Zirui Ge,  Zhen Yang
备注:5 pages, 4 figures
摘要:用于单声道语音增强的时频(T-F)域方法受益于深度学习的成功。近年来,研究的重点是设计双流网络模型,分别预测振幅掩模和相位,或者将振幅和相位耦合到笛卡尔坐标系中,构造实部和虚部。然而,大多数方法都受到双流网络框架中的幅度和相位(实部和虚部对)的对齐建模的影响,这不可避免地会导致性能限制。在本文中,我们介绍了一个图形傅立叶变换定义的奇异值分解(GFT-SVD),导致在神经语音增强的实值时间图表示。这种基于实值表示的GFT-SVD提供了对齐幅度和相位的建模的能力,从而避免恢复目标语音相位信息。我们的研究结果表明,基于GFT-SVD的实值时间图表示的中性语音增强的效果。大量的语音增强实验表明,GFT-SVD和DNN的组合优于GFT与特征向量分解(GFT-EVD)和幅度估计UNet的组合,并且在客观可懂度和感知质量方面优于短时傅里叶变换(STFT)和DNN。我们的源代码发布在:https://github.com/Wangfighting0015/GFT\_project。
摘要:Time-frequency (T-F) domain methods for monaural speech enhancement have benefited from the success of deep learning. Recently, focus has been put on designing two-stream network models to predict amplitude mask and phase separately, or, coupling the amplitude and phase into Cartesian coordinates and constructing real and imaginary pairs. However, most methods suffer from the alignment modeling of amplitude and phase (real and imaginary pairs) in a two-stream network framework, which inevitably incurs performance restrictions. In this paper, we introduce a graph Fourier transform defined with the singular value decomposition (GFT-SVD), resulting in real-valued time-graph representation for neural speech enhancement. This real-valued representation-based GFT-SVD provides an ability to align the modeling of amplitude and phase, leading to avoiding recovering the target speech phase information. Our findings demonstrate the effects of real-valued time-graph representation based on GFT-SVD for neutral speech enhancement. The extensive speech enhancement experiments establish that the combination of GFT-SVD and DNN outperforms the combination of GFT with the eigenvector decomposition (GFT-EVD) and magnitude estimation UNet, and outperforms the short-time Fourier transform (STFT) and DNN, regarding objective intelligibility and perceptual quality. We release our source code at: https://github.com/Wangfighting0015/GFT\_project.

【14】 Speech Retrieval-Augmented Generation without Automatic Speech  Recognition
标题: 无需自动语音识别的语音检索增强生成
链接:https://arxiv.org/abs/2412.16500
作者: Do June Min,  Karel Mundnich,  Andy Lapastora,  Erfan Soltanmohammadi,  Srikanth Ronanki,  Kyu Han
摘要:语音数据上的问题回答的一种常见方法是首先使用自动语音识别(ASR)转录语音,然后在转录上使用基于文本的检索增强生成(RAG)。虽然这种级联流水线在许多实际设置中被证明是有效的,但ASR错误可能会传播到检索和生成步骤。为了克服这一限制,我们引入了SpeechRAG,这是一个专为口语数据的开放式问题回答而设计的新框架。我们提出的方法微调预训练的语音编码器到语音适配器送入一个冻结的大语言模型(LLM)-基于检索模型。通过对齐文本和语音的嵌入空间,我们的语音检索器直接从基于文本的查询中检索音频段落,利用冻结文本检索器的检索能力。我们的口语问答数据集上的检索实验表明,直接语音检索不会降低基于文本的基线,并优于使用ASR的级联系统。对于生成,我们使用语音语言模型(SLM)作为生成器,条件是音频段落,而不是成绩单。在没有对SLM进行微调的情况下,当成绩单中存在高WER时,这种方法优于级联的基于文本的模型。
摘要:One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)--based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.

【15】 Enhancing Multilingual ASR for Unseen Languages via Language Embedding  Modeling
标题: 通过语言嵌入建模增强不可见语言的多语言ASB
链接:https://arxiv.org/abs/2412.16474
作者: Shao-Syuan Huang,  Kuan-Po Huang,  Andy T. Liu,  Hung-yi Lee
备注:Accepted by ICASSP 2025
摘要:多语言自动语音识别(ASR)的目标是在一个系统中识别和转录来自多种语言的语音。Whisper是最先进的ASR模型之一,在这一领域表现出色,它有效地处理了99种语言,利用了大量的数据,并将语言标签作为前缀来指导识别过程。然而,尽管取得了成功,Whisper仍然在努力学习看不见的语言,这些语言没有包括在它的预训练中。由于观察到许多语言都具有共同的语言特征,我们提出了利用这些关系来增强看不见的语言的ASR性能的方法。具体来说,我们引入了一个加权和方法,它计算语言标签的嵌入的加权和,使用Whisper的预测语言概率。此外,我们开发了一种基于预测器的方法,该方法改进了加权和嵌入,以更接近于未见过语言的真实嵌入。实验结果表明,在ASR性能的显着改善,无论是在zero-shot和微调设置。我们提出的方法优于基线方法,为解决多语言ASR中看不见的语言提供了有效的解决方案。
摘要:Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.

【16】 VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
标题: VERSA:语音、音频和音乐的多功能评估工具包
链接:https://arxiv.org/abs/2412.17667
作者: Jiatong Shi,  Hye-jin Shim,  Jinchuan Tian,  Siddhant Arora,  Haibin Wu,  Darius Petermann,  Jia Qi Yip,  You Zhang,  Yuxun Tang,  Wangyou Zhang,  Dareen Safar Alharthi,  Yichen Huang,  Koichi Saito,  Jionghao Han,  Yiwen Zhao,  Chris Donahue,  Shinji Watanabe
摘要:在这项工作中,我们介绍VERSA,一个统一的和标准化的评估工具包,设计用于各种语音,音频和音乐信号。该工具包具有Python界面,具有灵活的配置和依赖项控制,使其用户友好且高效。完整安装后,VERSA提供63个指标,根据不同的配置提供711个指标变化。这些指标包括利用各种外部资源的评估,包括匹配和不匹配的参考音频、文本翻译和文本标题。作为一个轻量级而全面的工具包,VERSA是通用的,以支持广泛的下游场景的评估。为了展示其功能,这项工作突出了VERSA的示例用例,包括音频编码,语音合成,语音增强,歌唱合成和音乐生成。该工具包可在https://github.com/shinjiwlab/versa上查阅。
摘要:In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 63 metrics with 711 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.

【17】 Multiple Consistency-guided Test-Time Adaptation for Contrastive  Audio-Language Models with Unlabeled Audio
标题: 具有未标记音频的对比音频语言模型的多重一致性引导测试时自适应
链接:https://arxiv.org/abs/2412.17306
作者: Gongyu Chen,  Haomin Zhang,  Chaofan Ding,  Zihao Chen,  Xinhan Di
备注:6 pages, 1 figure, accepted by ICASSP 2025
摘要:预训练的音频语言模型(ALM)学习的一个迷人的方面是它们令人印象深刻的zero-shot泛化能力和测试时自适应(TTA)方法,旨在提高没有注释的领域性能。然而,以前的测试时间自适应(TTA)方法的ALM在zero-shot分类往往被卡在不正确的模型预测。为了进一步提高性能,我们提出了在没有注释标签的情况下快速学习的多个指导。首先,对ALM的上下文令牌和域令牌设置一致性指导。其次,设置了每个单个测试样本的多个增强视图之间的一致性和不同测试样本之间的对比学习的指导。第三,我们提出了一个相应的端到端的学习框架,建议的测试时间适应方法没有注释标签。我们广泛地评估了我们的方法在12个跨域的下游任务,我们提出的自适应方法导致4.41%(最大7.50%)的平均zero-shot的性能改善相比,最先进的模型。
摘要:One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, previous test time adaptation (TTA) methods for ALMs in zero-shot classification tend to be stuck in incorrect model predictions. In order to further boost the performance, we propose multiple guidance on prompt learning without annotated labels. First, guidance of consistency on both context tokens and domain tokens of ALMs is set. Second, guidance of both consistency across multiple augmented views of each single test sample and contrastive learning across different test samples is set. Third, we propose a corresponding end-end learning framework for the proposed test-time adaptation method without annotated labels. We extensively evaluate our approach on 12 downstream tasks across domains, our proposed adaptation method leads to 4.41% (max 7.50%) average zero-shot performance improvement in comparison with the state-of-the-art models.

【18】 Trainingless Adaptation of Pretrained Models for Environmental Sound  Classification
标题: 环境声音分类预训练模型的免训练适应
链接:https://arxiv.org/abs/2412.17212
作者: Noriyuki Tonami,  Wataru Kohno,  Keisuke Imoto,  Yoshiyuki Yajima,  Sakiko Mishima,  Reishi Kondo,  Tomoyuki Hino
备注:Accepted to ICASSP2025
摘要:用于环境声音分类的基于深度神经网络(DNN)的模型对于训练数据不属于的域(即,分布外或看不见的数据)不鲁棒。为了利用针对未知领域的预训练模型,诸如微调和迁移学习的自适应方法与丰富的计算资源一起使用,例如,图形处理器(GPU)。然而,对于那些计算资源贫乏的人来说,跟上研究趋势变得越来越困难,因为最先进的模型正在变得计算资源密集型。在本文中,我们提出了一种无训练适应方法的预训练模型的环境声音分类。为了介绍无训练自适应方法,我们首先提出了一种在DNN模型的中间层中恢复时频(TF-ish)结构的操作。然后,我们提出了无需训练的频率滤波方法的域自适应,这不是一个广泛使用的基于梯度的优化。在ESC-50数据集上进行的实验表明,与传统方法相比,该自适应方法的分类精度提高了20.40个百分点。
摘要:Deep neural network (DNN)-based models for environmental sound classification are not robust against a domain to which training data do not belong, that is, out-of-distribution or unseen data. To utilize pretrained models for the unseen domain, adaptation methods, such as finetuning and transfer learning, are used with rich computing resources, e.g., the graphical processing unit (GPU). However, it is becoming more difficult to keep up with research trends for those who have poor computing resources because state-of-the-art models are becoming computationally resource-intensive. In this paper, we propose a trainingless adaptation method for pretrained models for environmental sound classification. To introduce the trainingless adaptation method, we first propose an operation of recovering time--frequency-ish (TF-ish) structures in intermediate layers of DNN models. We then propose the trainingless frequency filtering method for domain adaptation, which is not a gradient-based optimization widely used. The experiments conducted using the ESC-50 dataset show that the proposed adaptation method improves the classification accuracy by 20.40 percentage points compared with the conventional method.

【19】 InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions
标题: InterDance:具有现实二重唱互动的反应式3D舞蹈世代
链接:https://arxiv.org/abs/2412.16982
作者: Ronghui Li,  Youliang Zhang,  Yachao Zhang,  Yuxiang Zhang,  Mingyang Su,  Jie Guo,  Ziwei Liu,  Yebin Liu,  Xiu Li
备注:this https URL
摘要:人类进行各种各样的互动动作,其中二重唱舞蹈是最具挑战性的互动之一。然而,在人体运动生成模型方面,现有的工作仍然无法生成高质量的交互动作,特别是在二人转领域。一方面是由于缺乏大规模的高质量数据集。另一方面,它源于交互运动的不完整表示和缺乏对交互的细粒度优化。为了解决这些挑战,我们提出了一个大规模的二重唱舞蹈数据集InterDance,它可以显着提高运动质量,数据规模和舞蹈类型的多样性。基于此数据集,我们提出了一个新的运动表示,可以准确,全面地描述交互式运动。我们进一步引入了一个基于扩散的框架与交互细化指导策略,以优化交互的现实主义逐步。大量的实验证明了我们的数据集和算法的有效性。
摘要:Humans perform a variety of interactive motions, among which duet dance is one of the most challenging interactions. However, in terms of human motion generative models, existing works are still unable to generate high-quality interactive motions, especially in the field of duet dance. On the one hand, it is due to the lack of large-scale high-quality datasets. On the other hand, it arises from the incomplete representation of interactive motion and the lack of fine-grained optimization of interactions. To address these challenges, we propose, InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres. Built upon this dataset, we propose a new motion representation that can accurately and comprehensively describe interactive motion. We further introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively. Extensive experiments demonstrate the effectiveness of our dataset and algorithm.

【20】 AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory  Estimation and Classification
标题: AV-DTEC:用于无人机轨迹估计和分类的自我监督视听融合
链接:https://arxiv.org/abs/2412.16928
作者: Zhenyuan Xiao,  Yizhuo Yang,  Guili Xu,  Xianglong Zeng,  Shenghai Yuan
备注:Submitted to ICRA 2025
摘要:紧凑型无人机的使用越来越多,对公共安全造成了重大威胁,而传统的无人机探测系统往往体积庞大且成本高昂。为了应对这些挑战,我们提出了AV-DTEC,一个轻量级的自我监督的视听融合为基础的反无人机系统。AV-DTEC使用LiDAR生成的标签进行自监督学习训练,并通过并行选择性状态空间模型同时学习音频和视觉特征。通过学习的特征,特别设计的即插即用主辅助特征增强模块将视觉特征集成到音频特征中,以在交叉照明条件下具有更好的鲁棒性。为了减少对辅助特征的依赖并对齐模态,我们提出了一种教师-学生模型,该模型自适应地调整视觉特征的权重。AV-DTEC在真实世界的多模态数据中表现出卓越的准确性和有效性。代码和训练模型可在GitHub上公开访问   \url{https://github.com/AmazingDay1/AV-DETC}.
摘要:The increasing use of compact UAVs has created significant threats to public safety, while traditional drone detection systems are often bulky and costly. To address these challenges, we propose AV-DTEC, a lightweight self-supervised audio-visual fusion-based anti-UAV system. AV-DTEC is trained using self-supervised learning with labels generated by LiDAR, and it simultaneously learns audio and visual features through a parallel selective state-space model. With the learned features, a specially designed plug-and-play primary-auxiliary feature enhancement module integrates visual features into audio features for better robustness in cross-lighting conditions. To reduce reliance on auxiliary features and align modalities, we propose a teacher-student model that adaptively adjusts the weighting of visual features. AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world multi-modality data. The code and trained models are publicly accessible on GitHub  \url{https://github.com/AmazingDay1/AV-DETC}.

【21】 FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG  Distillation
标题: FADA:采用混合监督多CFC蒸馏的快速扩散阿凡达合成
链接:https://arxiv.org/abs/2412.16915
作者: Tianyun Zhong,  Chao Liang,  Jianwen Jiang,  Gaojie Lin,  Jiaqi Yang,  Zhou Zhao
摘要:基于扩散的音频驱动的说话化身方法最近因其高保真、生动和富有表现力的结果而受到关注。然而,其缓慢的推理速度限制了实际应用。尽管各种蒸馏技术的发展扩散模型,我们发现,天真的扩散蒸馏方法不产生令人满意的结果。与教师模型相比,蒸馏模型对开放集输入图像的鲁棒性降低,音频和视频之间的相关性降低,破坏了扩散模型的优势。为了解决这个问题,我们提出了FADA(快速扩散头像合成与混合监督多CFG蒸馏)。我们首先设计了一个混合监督损失,以利用不同质量的数据,并增强整体模型的能力和鲁棒性。此外,我们提出了一个多CFG蒸馏与可学习的令牌,利用音频和参考图像条件之间的相关性,减少了三倍的推理运行所造成的多CFG与可接受的质量下降。在多个数据集上进行的大量实验表明,FADA生成的视频与最近的基于扩散模型的方法相当,同时实现了4.17-12.5倍的NFE加速。演示可在我们的网页http://fadavatar.github.io上获得。
摘要:Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.

【22】 Temporal-Frequency State Space Duality: An Efficient Paradigm for Speech  Emotion Recognition
标题: 时频状态空间二元性:语音情感识别的有效范式
链接:https://arxiv.org/abs/2412.16904
作者: Jiaqi Zhao,  Fei Wang,  Kun Li,  Yanyan Wei,  Shengeng Tang,  Shu Zhao,  Xiao Sun
备注:Accepted by ICASSP 2025
摘要:语音情感识别(SER)在增强人机交互中的用户体验方面起着至关重要的作用。然而,现有的方法被淹没的时域分析,忽略了有价值的包络结构的频域是同样重要的鲁棒的情感识别。为了克服这一局限性,我们提出了TF-Mamba,一种新颖的多域框架,在时间和频率维度上捕获情感表达。具体来说,我们提出了一个时间-频率的Mamba块来提取时间和频率感知的情感特征,实现了计算效率和模型表达能力之间的最佳平衡。此外,我们设计了一个复杂的度量距离三元组(CMDT)损失,使模型能够捕捉代表性的情绪线索SER。IEMOCAP和MELD数据集上的广泛实验表明,TF-Mamba超越现有的方法在模型的大小和延迟,为未来的SER应用提供了一个更实用的解决方案。
摘要:Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba surpasses existing methods in terms of model size and latency, providing a more practical solution for future SER applications.

【23】 A Multi-modal Approach to Dysarthria Detection and Severity Assessment  Using Speech and Text Information
标题: 使用语音和文本信息进行构音障碍检测和严重性评估的多模式方法
链接:https://arxiv.org/abs/2412.16874
作者: Anuprabha M,  Krishna Gurugubelli,  Kesavaraj V,  Anil Kumar Vuppala
备注:5 pages, 1 figure
摘要:构音障碍的自动检测和严重程度评估对于向患者提供有针对性的治疗干预至关重要。虽然大多数现有的研究主要集中在语音模态,这项研究介绍了一种新的方法,利用语音和文本模态。通过采用交叉注意机制,我们的方法学习语音和文本表示之间的声学和语言相似性。这种方法特别评估不同严重程度的发音偏差,从而提高构音障碍检测和严重程度评估的准确性。所有的实验都是使用UA-言语构音障碍数据库进行的。当使用说话人相关和说话人无关、看不见和看到的词设置时,检测的准确率分别为99.53%和93.20%,严重性评估的准确率分别为98.12%和51.97%。这些研究结果表明,通过整合文本信息,它提供了一个参考的语言知识,一个更强大的框架已经开发出构音障碍的检测和评估,从而可能导致更有效的诊断。
摘要:Automatic detection and severity assessment of dysarthria are crucial for delivering targeted therapeutic interventions to patients. While most existing research focuses primarily on speech modality, this study introduces a novel approach that leverages both speech and text modalities. By employing cross-attention mechanism, our method learns the acoustic and linguistic similarities between speech and text representations. This approach assesses specifically the pronunciation deviations across different severity levels, thereby enhancing the accuracy of dysarthric detection and severity assessment. All the experiments have been performed using UA-Speech dysarthric database. Improved accuracies of 99.53% and 93.20% in detection, and 98.12% and 51.97% for severity assessment have been achieved when speaker-dependent and speaker-independent, unseen and seen words settings are used. These findings suggest that by integrating text information, which provides a reference linguistic knowledge, a more robust framework has been developed for dysarthric detection and assessment, thereby potentially leading to more effective diagnoses.

【24】 SoundLoc3D: Invisible 3D Sound Source Localization and Classification  Using a Multimodal RGB-D Acoustic Camera
标题: SoundLoc 3D:使用多模式RGB-D声学摄像机的隐形3D声音源定位和分类
链接:https://arxiv.org/abs/2412.16861
作者: Yuhang He,  Sangyun Shin,  Anoop Cherian,  Andrew Markham
备注:Accepted by WACV2025
摘要:准确定位3D声源并估计其语义标签-其中源可能不可见,但假设位于场景中对象的物理表面上-具有许多实际应用,包括检测气体泄漏和机械故障。视听弱相关在这种情况下提出了新的挑战,在推导创新的方法来回答我们是否或如何使用跨模态信息来解决这个问题。为此,我们建议使用由针孔RGB-D相机和共面四通道麦克风阵列~(Mic-Array)组成的声学相机装置。通过使用该装置记录来自多视图的视听信号,我们可以使用跨模态线索来估计声源的3D位置。具体来说,我们的框架SoundLoc 3D将任务视为集合预测问题,集合中的每个元素对应于一个潜在的声源。给定视听弱相关性,集合表示最初从单视图麦克风阵列信号学习,然后通过主动结合从多视图RGB-D图像揭示的物理表面线索来细化。我们在大规模模拟数据集上展示了SoundLoc 3D的效率和优越性,并进一步展示了其对RGB-D测量不准确和环境噪声干扰的鲁棒性。
摘要:Accurately localizing 3D sound sources and estimating their semantic labels -- where the sources may not be visible, but are assumed to lie on the physical surface of objects in the scene -- have many real applications, including detecting gas leak and machinery malfunction. The audio-visual weak-correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross-modal information to solve the task. Towards this end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array~(Mic-Array). By using this rig to record audio-visual signals from multiviews, we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically, our framework SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation, the set representation is initially learned from a single view microphone array signal, and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

【25】 Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement
标题: Mamba-SEUNet:用于单耳语音增强的Mamba UNet
链接:https://arxiv.org/abs/2412.16626
作者: Junyu Wang,  Zizhen Lin,  Tianrui Wang,  Meng Ge,  Longbiao Wang,  Jianwu Dang
摘要:在最近的语音增强(SE)研究中,Transformer及其变体已经成为主流方法。然而,自我注意机制的二次复杂性对实际部署施加了一定的限制。Mamba作为一种新型的状态空间模型(SSM),以其较强的长序列建模能力和较低的计算复杂度在自然语言处理和计算机视觉领域得到了广泛的应用。在这项工作中,我们介绍了Mamba-SEUNet,一个创新的架构,将Mamba与U-Net集成为SE任务。通过利用双向Mamba对不同分辨率下语音信号的前向和后向依赖关系进行建模,并结合跳过连接来捕获多尺度信息,我们的方法实现了最先进的(SOTA)性能。在VCTK+DEMAND数据集上的实验结果表明,Mamba-SEUNet的PESQ得分为3.59,同时保持了较低的计算复杂度。当与感知对比度拉伸技术相结合时,Mamba-SEUNet进一步将PESQ评分提高到3.73。
摘要:In recent speech enhancement (SE) research, transformer and its variants have emerged as the predominant methodologies. However, the quadratic complexity of the self-attention mechanism imposes certain limitations on practical deployment. Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision due to its strong capabilities in modeling long sequences and relatively low computational complexity. In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks. By leveraging bidirectional Mamba to model forward and backward dependencies of speech signals at different resolutions, and incorporating skip connections to capture multi-scale information, our approach achieves state-of-the-art (SOTA) performance. Experimental results on the VCTK+DEMAND dataset indicate that Mamba-SEUNet attains a PESQ score of 3.59, while maintaining low computational complexity. When combined with the Perceptual Contrast Stretching technique, Mamba-SEUNet further improves the PESQ score to 3.73.

【26】 Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech  Translation
标题: 提高直接视听语音翻译中的唇音同步性
链接:https://arxiv.org/abs/2412.16530
作者: Lucas Goncalves,  Prashant Mathur,  Xing Niu,  Brady Houston,  Chandrashekhar Lavania,  Srikanth Vishnubhotla,  Lijia Sun,  Anthony Ferritto
备注:Accepted at ICASSP, 4 pages
摘要:视听语音到语音翻译通常优先考虑提高翻译质量和自然度。然而,在视听内容中,一个同样重要的方面是嘴唇同步-确保嘴唇的动作与所说的内容相匹配-对于保持配音视频的真实感至关重要。尽管它的重要性,包括在AVS 2S模型唇同步约束已在很大程度上被忽视。这项研究通过将嘴唇同步损失整合到AVS 2S模型的训练过程中来解决这一差距。我们提出的方法显着增强了直接视听语音到语音翻译中的唇同步,实现了10.67的平均LSE-D得分,在四种语言对的强基线上,LSE-D降低了9.2%。此外,它在叠加到原始视频上时保持了翻译语音的自然度和高质量,而不会降低翻译质量。
摘要:Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.

【27】 Text2midi: Generating Symbolic Music from Captions
标题: 文本2 midi:从字幕生成象征性音乐
链接:https://arxiv.org/abs/2412.16526
作者: Keshav Bhandari,  Abhinaba Roy,  Kyra Wang,  Geeta Puri,  Simon Colton,  Dorien Herremans
备注:9 pages, 3 figures, Accepted at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)
摘要:本文介绍了text 2 midi,一个端到端的模型来生成的文本描述文件。利用多模态生成方法的日益普及,text 2 midi利用了文本数据的广泛可用性和大型语言模型(LLM)的成功。我们的端到端系统利用LLM的强大功能以MIDI文件的形式生成符号音乐。具体来说,我们利用一个预先训练的LLM编码器来处理字幕,然后条件自回归Transformer解码器产生的序列,准确地反映了所提供的描述。这种直观且用户友好的方法通过允许用户使用文本提示生成音乐片段来显著简化音乐创作过程。我们进行了全面的实证评估,结合了自动化和人类研究,表明我们的模型生成了高质量的文本文件,这些文件确实可以通过文本标题进行控制,其中可能包括和弦,键和节奏等音乐理论术语。我们在演示页面(https://github.com/AMAAI-Lab/Text2midi)上发布了代码和音乐示例,供用户与text 2 midi进行交互。
摘要:This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. We release the code and music samples on our demo page (https://github.com/AMAAI-Lab/Text2midi) for users to interact with text2midi.

【28】 Adapting Whisper for Code-Switching through Encoding Refining and  Language-Aware Decoding
标题: 通过编码精炼和数字感知解码将Whisper适应代码切换
链接:https://arxiv.org/abs/2412.16507
作者: Jiahui Zhao,  Hao Shi,  Chenrui Cui,  Tianrui Wang,  Hexin Liu,  Zhaoheng Ni,  Lingxuan Ye,  Longbiao Wang
摘要:由于口音、听觉相似性和无缝语言切换所导致的语言混淆,码切换自动语音识别面临挑战。对预训练的多语言模型的适应已显示出CS-ASR的良好性能。在本文中,我们适应Whisper,这是一个大规模的多语种预训练的语音识别模型,CS从编码器和解码器部分。首先,我们提出了一个编码器细化,以提高编码器的能力,句内切换。其次,我们建议使用两套不同的语言提示嵌入的语言感知适配器,以实现在每个解码器层的语言特定的解码信息。然后,增加一个融合模块来融合语言感知解码。使用SEAME数据集的实验结果表明,与基线模型相比,所提出的方法在dev_man和dev_sge测试集上分别实现了4.1%和7.2%的相对MER降低,超过了最先进的方法。通过实验,我们发现该方法显著提高了CS语音中非母语的识别性能,表明该方法能够使Whisper更好地区分两种语言。
摘要:Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder's capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.

【29】 Transducer-Llama: Integrating LLMs into Streamable Transducer-based  Speech Recognition
标题: Transducer-Llama:将LLM集成到基于Transducer-Llama的可流语音识别中
链接:https://arxiv.org/abs/2412.16464
作者: Keqi Deng,  Jinxi Guo,  Yingyi Ma,  Niko Moritz,  Philip C. Woodland,  Ozlem Kalinli,  Mike Seltzer
备注:Accepted by ICASSP 2025
摘要:虽然大型语言模型(LLM)已应用于自动语音识别(ASR),但使模型可流传输的任务仍然是一个挑战。本文提出了一种新的模型架构,换能器-美洲驼,集成LLM到一个因子化换能器(FT)模型,自然使流功能。此外,鉴于LLM的大词汇量可能会导致数据稀疏性问题和口语系统的训练成本增加,本文介绍了一种有效的词汇自适应技术,以对齐LLM与语音系统词汇。结果表明,使用RNN-T损失直接优化FT模型,使用一个强大的基于LLM的预测器,与较小的预训练LM预测器相比,可以产生一些但有限的改进。因此,本文提出了一种从弱到强的LM交换策略,在RNN-T损失训练期间使用弱LM预测器,然后用强LLM代替它。在LM替换之后,采用最小字错误率(MWER)损失来微调LLM预测器与换能器-Llama模型的集成。在LibriSpeech和大规模多语言LibriSpeech语料库上的实验表明,所提出的流式Transducer-Llama方法在强FT基线上提供了17%的相对WER减少(WRR),在RNN-T基线上提供了32%的WRR。
摘要:While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

【30】 A Classification Benchmark for Artificial Intelligence Detection of  Laryngeal Cancer from Patient Speech
标题: 人工智能从患者语音检测喉癌的分类基准
链接:https://arxiv.org/abs/2412.16267
作者: Mary Paterson,  James Moor,  Luisa Cutillo
备注:24 pages, 6 figures, 7 tables
摘要:据预测,喉癌病例在未来几年将显著增加。目前的诊断途径导致许多患者被错误地转诊到紧急疑似癌症途径,给患者和医疗系统带来了不必要的压力。   人工智能提供了一个很有前途的解决方案,可以从患者的语音中无创检测喉癌,这可以帮助更有效地优先考虑转诊,并减少非癌症患者的不适当转诊。为了实现这一潜力,开放科学至关重要。该领域的一个主要障碍是缺乏开源数据集和可复制的基准,迫使研究人员从头开始。我们的工作通过引入一个基准套件来解决这一挑战,该套件包括在开源数据集上训练和评估的36个模型。这些模型可以在公共存储库中访问,为未来的研究提供基础。他们评估了三种不同的算法和三种音频特征集,提供了一个全面的基准测试框架。我们提出了标准化的指标和评估方法,以确保未来研究的一致性和可比性。   所提出的模型包括仅音频输入和多模态输入,其中包含人口统计和症状数据,使其能够应用于具有不同患者信息的数据集。通过提供这些基准,未来的研究人员可以评估他们的数据集,完善模型,并将其作为更先进方法的基础。这项工作旨在为建立可重复的基准提供基线,使研究人员能够将新方法与这些标准进行比较,并最终推进用于检测喉癌的人工智能工具的开发。
摘要:Cases of laryngeal cancer are predicted to rise significantly in the coming years. Current diagnostic pathways cause many patients to be incorrectly referred to urgent suspected cancer pathways, putting undue stress on both patients and the medical system.  Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient speech, which could help prioritise referrals more effectively and reduce inappropriate referrals of non-cancer patients. To realise this potential, open science is crucial. A major barrier in this field is the lack of open-source datasets and reproducible benchmarks, forcing researchers to start from scratch. Our work addresses this challenge by introducing a benchmark suite comprising 36 models trained and evaluated on open-source datasets. These models are accessible in a public repository, providing a foundation for future research. They evaluate three different algorithms and three audio feature sets, offering a comprehensive benchmarking framework. We propose standardised metrics and evaluation methodologies to ensure consistent and comparable results across future studies.  The presented models include both audio-only inputs and multimodal inputs that incorporate demographic and symptom data, enabling their application to datasets with diverse patient information. By providing these benchmarks, future researchers can evaluate their datasets, refine the models, and use them as a foundation for more advanced approaches. This work aims to provide a baseline for establishing reproducible benchmarks, enabling researchers to compare new methods against these standards and ultimately advancing the development of AI tools for detecting laryngeal cancer.

【31】 Decoding Poultry Vocalizations -- Natural Language Processing and  Transformer Models for Semantic and Emotional Analysis
标题: 家禽发声解码--用于语义和情感分析的自然语言处理和Transformer模型
链接:https://arxiv.org/abs/2412.16182
作者: Venkatraman Manikandan,  Suresh Neethirajan
备注:28 Pages, 14 figures
摘要:破译鸡的声音语言为动物福利和生态信息学提供了新的机会。他们微妙的声音信号编码健康状况,情绪状态和生态系统内的动态互动。理解这些叫声的语义为解释它们的功能词汇和澄清每个声音在社会和环境背景下如何服务于特定目的提供了一个有价值的工具。我们应用先进的自然语言处理和基于Transformer的模型将生物声学数据转化为有意义的见解。我们的方法集成了Wave2Vec 2.0用于原始音频特征提取,并从Transformers模型中微调双向编码器表示,在广泛的动物声音语料库上进行预训练,并适应家禽任务。该管道将家禽发声解码为可解释的类别,包括求救信号,喂食信号和交配发声,揭示了传统分析经常忽略的情感细微差别。在对关键发声类型进行分类方面,我们的方法达到了92%的准确率,证明了实时自动监测羊群健康和压力的可行性。通过跟踪这些功能性词汇,农民可以主动应对环境或行为变化,改善家禽福利,减少与压力相关的生产力损失,并支持更可持续的农场管理。除了农业,这项研究还增强了我们对计算生态学的理解。动物叫声的语义基础可能表明生物多样性,环境压力和物种相互作用,为综合生态系统水平的决策提供信息。
摘要:Deciphering the acoustic language of chickens offers new opportunities in animal welfare and ecological informatics. Their subtle vocal signals encode health conditions, emotional states, and dynamic interactions within ecosystems. Understanding the semantics of these calls provides a valuable tool for interpreting their functional vocabulary and clarifying how each sound serves a specific purpose in social and environmental contexts. We apply advanced Natural Language Processing and transformer based models to translate bioacoustic data into meaningful insights. Our method integrates Wave2Vec 2.0 for raw audio feature extraction with a fine tuned Bidirectional Encoder Representations from Transformers model, pretrained on a broad corpus of animal sounds and adapted to poultry tasks. This pipeline decodes poultry vocalizations into interpretable categories including distress calls, feeding signals, and mating vocalizations, revealing emotional nuances often overlooked by conventional analyses. Achieving 92 percent accuracy in classifying key vocalization types, our approach demonstrates the feasibility of real time automated monitoring of flock health and stress. By tracking this functional vocabulary, farmers can respond proactively to environmental or behavioral changes, improving poultry welfare, reducing stress related productivity losses, and supporting more sustainable farm management. Beyond agriculture, this research enhances our understanding of computational ecology. Accessing the semantic foundation of animal calls may indicate biodiversity, environmental stressors, and species interactions, informing integrative ecosystem level decision making.

【32】 Efficient VoIP Communications through LLM-based Real-Time Speech  Reconstruction and Call Prioritization for Emergency Services
标题: 通过基于LLM的实时语音重建和紧急服务呼叫优先级实现高效的IP电话通信
链接:https://arxiv.org/abs/2412.16176
作者: Danush Venkateshperumal,  Rahman Abdul Rafi,  Shakil Ahmed,  Ashfaq Khokhar
备注:15 pages,8 figures
摘要:紧急通信系统面临由于VoIP系统中的分组丢失、带宽限制、差的信号质量、延迟和抖动而导致的中断,从而导致实时服务质量下降。由于恐慌、语言障碍和背景噪音,遇难者往往难以传达关键信息,这进一步使调度员准确评估情况的能力变得复杂。急救中心的人员短缺加剧了协调和援助的延误。本文提出利用大型语言模型(LLM)来解决这些挑战,通过重建不完整的语音,填补上下文空白,并根据严重程度对呼叫进行优先排序。该系统将实时转录与检索增强生成(RAG)集成在一起,以生成上下文响应,并使用Twilio和AssemblyAI API进行无缝实现。评估结果显示,该模型具有高精度、良好的BLEU和ROUGE评分,并与现实世界的需求保持一致,表明该模型具有优化应急响应工作流程和有效优先处理关键案例的潜力。
摘要:Emergency communication systems face disruptions due to packet loss, bandwidth constraints, poor signal quality, delays, and jitter in VoIP systems, leading to degraded real-time service quality. Victims in distress often struggle to convey critical information due to panic, speech disorders, and background noise, further complicating dispatchers' ability to assess situations accurately. Staffing shortages in emergency centers exacerbate delays in coordination and assistance. This paper proposes leveraging Large Language Models (LLMs) to address these challenges by reconstructing incomplete speech, filling contextual gaps, and prioritizing calls based on severity. The system integrates real-time transcription with Retrieval-Augmented Generation (RAG) to generate contextual responses, using Twilio and AssemblyAI APIs for seamless implementation. Evaluation shows high precision, favorable BLEU and ROUGE scores, and alignment with real-world needs, demonstrating the model's potential to optimize emergency response workflows and prioritize critical cases effectively.

机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区

语音之家
助力AI语音开发者的社区
 最新文章