语音/音频处理学术速递[11.5]

文摘   2024-11-05 18:00   北京  
今日论文合集:cs.SD语音13篇,eess.AS音频处理16篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音
【1】3D Audio-Visual Segmentation
标题:3D视听分割
链接:https://arxiv.org/abs/2411.02236
作者:Artem Sokolov,  Swapnil Bhosale,  Xiatian Zhu
备注:Accepted at the NeurIPS 2024 Workshop on Audio Imagination
摘要:识别场景中的发声对象是体现AI的长期目标,在机器人和AR/VR/MR中具有多种应用。为此,视听分割(AVS),以音频信号为条件,利用同步相机和麦克风传感器识别输入图像中目标发声对象的掩码,最近已经取得了进展。然而,这种范例仍然不足以用于现实世界的操作,因为缺少从2D图像到3D场景的映射。为了解决这个根本的限制,我们引入了一个新的研究问题,三维视听分割,扩展现有的AVS的三维输出空间。这个问题提出了更多的挑战,由于变化的摄像机extrinsics,音频散射,闭塞,和不同的声音在发声对象类别。为了促进这项研究,我们创建了第一个基于仿真的基准测试3DAVS-S34-O 7,在单实例和多实例设置下,在34个场景和7个对象类别中提供具有真实感的3D场景环境和接地空间音频。这是通过重新利用生境模拟器来生成探测对象位置和相应3D掩模的综合注释而实现的。随后,我们提出了一种新的方法,EchoSegnet,其特征在于通过空间音频感知掩码对齐和细化,将来自预训练的2D视听基础模型的现成知识与3D视觉场景表示协同集成。大量的实验表明,EchoSegnet可以在我们的新基准上有效地分割3D空间中的发声对象,这代表了体现AI领域的重大进步。项目页面:https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/
摘要:Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/

【2】 Addressing Representation Collapse in Vector Quantized Models with One  Linear Layer
标题:解决具有一个线性层的载体量化模型中的崩溃表示
链接:https://arxiv.org/abs/2411.02038
作者:Yongxin Zhu,  Bocheng Li,  Yifei Xin,  Linli Xu
摘要:矢量量化(VQ)是一种广泛使用的将连续表示转换为离散代码的方法,它已成为无监督表示学习和潜在生成模型的基础。然而,VQ模型往往受到潜在空间中的表示崩溃问题的阻碍,这导致低码本利用率,并限制了大规模训练的码本的可扩展性。现有的方法,旨在减轻表示崩溃通常减少潜在空间的维数在模型容量的代价,这并没有完全解决的核心问题。在这项研究中,我们进行了理论分析的表示崩溃的VQ模型,并确定其主要原因是不相交的优化的码本,其中只有一小部分的代码矢量更新通过梯度下降。为了解决这个问题,我们提出了\textbf{SimVQ},这是一种新的方法,它通过基于可学习潜在基础的线性变换层重新参数化代码向量。这种变换优化了码本所覆盖的\textit{整个线性空间},而不仅仅是更新vanilla VQ模型中最近邻搜索所选择的\textit{码向量}。虽然通常认为两个线性矩阵的乘法相当于应用单个线性层,但我们的方法在解决仅具有一个线性层的VQ模型中的崩溃问题时效果令人惊讶。我们通过各种形式的广泛实验验证了SimVQ的有效性,包括具有不同模型架构的图像和音频数据。我们的代码可以在\url{https://github.com/Simsheen/SimVQ}上找到。
摘要:Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbf{SimVQ}, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textit{entire linear space} spanned by the codebook, rather than merely updating \textit{the code vector} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at \url{https://github.com/youngsheen/SimVQ}.

【3】 CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre  Ensemble Modeling and Flow Matching
标题:CTEFM-VC:基于内容感知音色集合建模和流匹配的Zero-Shot语音转换
链接:https://arxiv.org/abs/2411.02026
作者:Yu Pan,  Yuguang Yang,  Jixun Yao,  Jianhao Ye,  Hongbin Zhou,  Lei Ma,  Jianjun Zhao
备注:Work in progress; 5 pages;
摘要:Zero-shot语音转换(VC)的目的是将源说话人的音色转换为任何以前看不见的目标说话人,同时保留原始的语言内容。尽管取得了显著进展,但要达到与地面实况录音相当的说话者相似度和自然度,仍然是一个巨大的挑战。在本文中,我们提出了CTEFM-VC,一个zero-shot VC框架,利用内容感知的音色包围建模和流匹配。具体而言,CTEFM-VC将话语分解为语言内容和音色表示,随后利用条件流匹配模型和声码器来重建梅尔频谱图和波形。为了提高其音色建模能力和生成的语音的自然性,我们提出了一种上下文感知的音色集成建模方法,自适应地集成了不同的说话人验证嵌入,并通过一个交叉注意模块,使语言和音色功能的联合利用。实验表明,我们的CTEFM-VC系统超越国家的最先进的VC方法在说话人相似度和自然度至少18.5%和7.0%。
摘要:Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that leverages Content-aware Timbre Ensemble modeling and Flow Matching. Specifically, CTEFM-VC disentangles utterances into linguistic content and timbre representations, subsequently utilizing a conditional flow matching model and a vocoder to reconstruct the mel-spectrogram and waveform. To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the joint utilization of linguistic and timbre features through a cross-attention module. Experiments show that our CTEFM-VC system surpasses state-of-the-art VC methods in both speaker similarity and naturalness by at least 18.5% and 7.0%.

【4】 MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and  Correspondence
标题:MoMu-扩散:关于学习长期运动-音乐同步和对应
链接:https://arxiv.org/abs/2411.01805
作者:Fuming You,  Minghui Fang,  Li Tang,  Rongjie Huang,  Yongqi Wang,  Zhou Zhao
备注:NeurIPS 2024
摘要:运动到音乐和音乐到运动已经被单独研究,每个都在各自的领域吸引了大量的研究兴趣。人体运动与音乐之间的相互作用是人类高级智能的体现,在它们之间建立统一的关系尤为重要。然而,迄今为止,还没有任何工作,认为他们共同探讨内部的模式调整。为了弥合这一差距,我们提出了一个新的框架,称为MoMu-Diffusion,用于长期和同步的运动音乐生成。首先,为了减轻长序列带来的巨大计算成本,我们提出了一种新的双向对比节奏变分自动编码器(BiCoR-VAE),它可以提取运动和音乐输入的模态对齐潜在表示。随后,利用对齐的潜在空间,我们引入了基于多模态变换器的扩散模型和交叉指导采样策略,以实现各种生成任务,包括跨模态,多模态和可变长度生成。大量的实验表明,MoMu-Diffusion在定性和定量上都超过了最近的最先进的方法,并且可以合成逼真的,多样的,长期的和节拍匹配的音乐或运动序列。生成的示例和代码可在https://momu-diffusion.github.io/上获得
摘要:Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establishing a unified relationship between them is particularly important. However, to date, there has been no work that considers them jointly to explore the modality alignment within. To bridge this gap, we propose a novel framework, termed MoMu-Diffusion, for long-term and synchronous motion-music generation. Firstly, to mitigate the huge computational costs raised by long sequences, we propose a novel Bidirectional Contrastive Rhythmic Variational Auto-Encoder (BiCoR-VAE) that extracts the modality-aligned latent representations for both motion and music inputs. Subsequently, leveraging the aligned latent spaces, we introduce a multi-modal Transformer-based diffusion model and a cross-guidance sampling strategy to enable various generation tasks, including cross-modal, multi-modal, and variable-length generation. Extensive experiments demonstrate that MoMu-Diffusion surpasses recent state-of-the-art methods both qualitatively and quantitatively, and can synthesize realistic, diverse, long-term, and beat-matched music or motion sequences. The generated samples and codes are available at https://momu-diffusion.github.io/

【5】 SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
标题:SPES:可解释语音到文本生成的频谱图微扰
链接:https://arxiv.org/abs/2411.01710
作者:Dennis Fucci,  Marco Gaido,  Beatrice Savoldi,  Matteo Negri,  Mauro Cettolo,  Luisa Bentivogli
摘要:在对可解释模型的需求的推动下,针对语言技术的可解释人工智能的研究经历了显著的增长,特征归因方法成为这一进展的基石。虽然NLP之前的工作探索了分类任务和文本应用的方法,但可解释性交叉生成和语音是滞后的,现有技术未能解释最先进模型的自回归性质,并提供细粒度,语音有意义的解释。我们通过引入用于可解释语音到文本生成(SPES)的频谱图扰动来解决这一差距,SPES是一种适用于自回归模型序列生成任务的特征归因技术。SPES基于输入的频谱图和先前生成的标记为每个预测的标记提供解释。语音识别和翻译的广泛评估表明,SPES生成的解释是忠实的,似是而非的人。
摘要:Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques failing to account for the autoregressive nature of state-of-the-art models and to provide fine-grained, phonetically meaningful explanations. We address this gap by introducing Spectrogram Perturbation for Explainable Speech-to-text Generation (SPES), a feature attribution technique applicable to sequence generation tasks with autoregressive models. SPES provides explanations for each predicted token based on both the input spectrogram and the previously generated tokens. Extensive evaluation on speech recognition and translation demonstrates that SPES generates explanations that are faithful and plausible to humans.

【6】 Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations
标题:即兴唱歌:简单的文本可控制的陪伴一代
链接:https://arxiv.org/abs/2411.01661
作者:Quoc-Huy Trinh,  Minh-Van Nguyen,  Trong-Hieu Nguyen Mau,  Khoa Tran,  Thanh Do
摘要:唱歌是人类最珍贵的娱乐形式之一。然而,创作一首优美的歌曲需要一个伴奏,补充人声,并与歌曲乐器和流派保持一致。随着深度学习的进步,以前的研究主要集中在生成合适的乐器,但往往缺乏与所需乐器和流派的精确对齐。为了解决这个问题,我们提出了一个简单的方法,通过文本提示控制伴奏,允许生成补充人声并符合歌曲乐器和流派要求的音乐。通过大量的实验,我们成功地产生了10秒的语音输入和文本控制。
摘要:Singing is one of the most cherished forms of human entertainment. However, creating a beautiful song requires an accompaniment that complements the vocals and aligns well with the song instruments and genre. With advancements in deep learning, previous research has focused on generating suitable accompaniments but often lacks precise alignment with the desired instrumentation and genre. To address this, we propose a straightforward method that enables control over the accompaniment through text prompts, allowing the generation of music that complements the vocals and aligns with the song instrumental and genre requirements. Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.

【7】 Fish-Speech: Leveraging Large Language Models for Advanced Multilingual  Text-to-Speech Synthesis
标题:Fish-Speech:利用大型语言模型进行高级多语言文本到语音合成
链接:https://arxiv.org/abs/2411.01156
作者:Shijia Liao,  Yuxuan Wang,  Tianyu Li,  Yifan Cheng,  Ruoyi Zhang,  Rongzhi Zhou,  Yijin Xing
摘要:文本到语音(TTS)系统在处理复杂的语言特征、处理复调表达和产生自然的多语言语音方面面临着持续的挑战,这些能力对未来的人工智能应用至关重要。在本文中,我们提出了鱼语音,一种新的框架,实现了一个串行的快-慢双自回归(Dual-AR)架构,以提高稳定性的分组有限标量矢量量化(GFSQ)序列生成任务。这种架构提高了码本处理效率,同时保持了高保真输出,使其对AI交互和语音克隆特别有效。Fish-Speech利用大型语言模型(LLM)进行语言特征提取,消除了传统的字形到音素(G2 P)转换的需要,从而简化了合成管道并增强了多语言支持。此外,我们通过GFSQ开发了FF-GAN,以实现优异的压缩比和接近100%的码本利用率。我们的方法解决了目前的TTS系统的关键限制,同时提供了一个更复杂的,上下文感知的语音合成的基础。实验结果表明,Fish-Speech在处理复杂的语言场景和语音克隆任务方面明显优于基线模型,展示了其在人工智能应用中推进TTS技术的潜力。该实现在\href{https://github.com/fishaudio/fish-speech}{https://github.com/fishaudio/fish-speech}上开源。
摘要:Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning.  Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization.  Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{https://github.com/fishaudio/fish-speech}{https://github.com/fishaudio/fish-speech}.

【8】 Music Foundation Model as Generic Booster for Music Downstream Tasks
标题:音乐基金会模型作为音乐下游任务的通用助推器
链接:https://arxiv.org/abs/2411.01135
作者:WeiHsiang Liao,  Yuhta Takida,  Yukara Ikemiya,  Zhi Zhong,  Chieh-Hsin Lai,  Giorgio Fabbro,  Kazuki Shimada,  Keisuke Toyama,  Kinwai Cheuk,  Marco Martinez,  Shusuke Takahashi,  Stefan Uhlich,  Taketo Akama,  Woosung Choi,  Yuichiro Koyama,  Yuki Mitsufuji
备注:41 pages with 14 figures
摘要:我们证明了使用中间表示从一个单一的基础模型,以提高各种音乐下游任务的有效性。我们介绍SoniDo,音乐基础模型(MFM),旨在从目标音乐样本中提取层次特征。通过利用分层中间功能,SoniDo限制了信息粒度,从而提高了各种下游任务(包括理解和生成任务)的性能。我们专门评估了这种方法的代表性任务,如音乐标记,音乐转录,音乐源分离和音乐混合。我们的研究结果表明,从基础模型中提取的特征为训练下游任务模型提供了有价值的增强。这突出了使用从音乐基础模型中提取的特征作为下游任务的助推器的能力。我们的方法不仅有利于现有的特定任务的模型,但也支持音乐下游的数据稀缺性的限制任务。这为更有效和更容易获得的音乐处理解决方案铺平了道路。
摘要:We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo , a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

【9】 Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An  Evaluation Using TORGO
标题:增强电子健康环境中针对发音障碍者的AAC软件:使用TORGO的评估
链接:https://arxiv.org/abs/2411.00980
作者:Macarious Hui,  Jinda Zhang,  Aanchan Mohan
摘要:患有脑瘫(CP)和肌萎缩侧索硬化症(ALS)的个体经常面临发音的挑战,导致构音障碍并导致非典型的语音模式。在医疗机构中,通信故障会降低护理质量。在构建增强和替代通信(AAC)工具以实现流畅的通信时,我们发现Whisper和Wav2vec2.0等最先进(SOTA)自动语音识别(ASR)技术将非典型说话者边缘化,这主要是由于缺乏训练数据。我们的工作着眼于利用SOTA ASR,然后进行特定于域的纠错。英语构音障碍的ASR表现通常在TORGO数据集上进行评估。重叠是这个数据集的一个众所周知的问题,其中训练和测试说话者之间的短语重叠。我们的工作提出了一个算法来打破这种重叠。在减少发音重叠后,SOTA ASR模型的结果对轻度和重度构音障碍的说话者产生了极高的单词错误率。此外,为了改善ASR,我们的工作着眼于n-gram语言模型和基于大语言模型(LLM)的多模态生成纠错算法(如Whispering-LLaMA)对第二遍ASR的影响。我们的工作强调了需要做更多的工作来改善非典型扬声器的ASR,以实现面对面和电子健康环境中的公平医疗保健。
摘要:Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. In healthcare settings, coomunication breakdowns reduce the quality of care. While building an augmentative and alternative communication (AAC) tool to enable fluid communication we found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction. English dysarthric ASR performance is often evaluated on the TORGO dataset. Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. After reducing prompt-overlap, results with SOTA ASR models produce extremely high word error rates for speakers with mild and severe dysarthria. Furthermore, to improve ASR, our work looks at the impact of n-gram language models and large-language model (LLM) based multi-modal generative error-correction algorithms like Whispering-LLaMA for a second pass ASR. Our work highlights how much more needs to be done to improve ASR for atypical speakers to enable equitable healthcare access both in-person and in e-health settings.

【10】 Joint Training of Speaker Embedding Extractor, Speech and Overlap  Detection for Diarization
标题:说话人嵌入提取器、语音和重叠检测的联合训练以实现数字化
链接:https://arxiv.org/abs/2411.02165
作者:Petr Pálka,  Federico Landini,  Dominik Klement,  Mireia Diez,  Anna Silnova,  Marc Delcroix,  Lukáš Burget
摘要:尽管端到端的日记系统现在很流行,但由语音活动检测(VAD)、说话人嵌入提取加聚类和重叠语音检测(OSD)加处理组成的模块化系统在许多情况下仍然具有竞争力的性能。然而,模块化系统的主要缺点之一是需要独立地运行(和训练)不同的模块。在这项工作中,我们提出了一种方法,联合训练模型,同时产生扬声器嵌入,VAD和OSD,并达到有竞争力的性能在一个标准的方法的推理时间的一小部分。此外,联合推理导致简化的整体管道,这使我们更接近于统一的基于聚类的方法,该方法可以针对特定于diarization的目标进行端到端的训练。
摘要:In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a standard approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

【11】 Modulating State Space Model with SlowFast Framework for  Compute-Efficient Ultra Low-Latency Speech Enhancement
标题:使用SlowFast框架调制状态空间模型以实现计算高效的超低延迟语音增强
链接:https://arxiv.org/abs/2411.02019
作者:Longbiao Cheng,  Ashutosh Pandey,  Buye Xu,  Tobi Delbruck,  Vamsi Krishna Ithapu,  Shih-Chii Liu
备注:Submitted to ICASSP 2025
摘要:基于深度学习的语音增强(SE)方法在需要满足低延迟要求时,由于要处理的帧数量增加,通常会面临重大的计算挑战。本文介绍了SlowFast框架,旨在降低计算成本,特别是当需要低延迟增强。该框架由以低帧速率分析声学环境的慢分支和以所需的较高帧速率在时域中执行SE以匹配所需延迟的快分支组成。具体地,快分支采用状态空间模型,其中其状态转换过程由慢分支动态调制。使用Voice Bank + Demand数据集对具有2 ms算法延迟要求的SE任务进行的实验表明,与具有等效参数的基线单分支网络相比,我们的方法将计算成本降低了70%,而不会影响增强性能。此外,通过利用SlowFast框架,我们实现了一个算法延迟仅为60 {\mu}s(16 kHz采样率下的一个采样点)的网络,计算成本为100 M MAC/s,同时PESQ-NB为3.12,SISNR为16.62。
摘要:Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 60 {\mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.

【12】 Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event  Detection
标题:利用LLM和文本查询分离进行噪音稳健的声音事件检测
链接:https://arxiv.org/abs/2411.01174
作者:Han Yin,  Yang Xiao,  Jisheng Bai,  Rohan Kumar Das
备注:Submitted to ICASSP 2025 Workshop
摘要:在噪声环境中,重叠的声音掩盖了目标事件,声音事件检测(SED)是具有挑战性的。音频源分离(LASS)的目的是从噪声片段中分离出目标声音事件。然而,当确切的目标声音未知时,这种方法可能会失败,特别是在嘈杂的测试集中,导致性能降低。为了解决这个问题,我们利用大型语言模型(LLM)的功能来分析和总结声学数据。通过使用LLM来识别和选择特定的噪声类型,我们实现了一种噪声增强方法,用于噪声鲁棒微调。微调的模型被应用于预测剪切式事件预测作为LASS模型的文本查询。我们的研究表明,该方法提高了SED性能在嘈杂的环境中。这项工作代表了噪声鲁棒SED中的LLM的早期应用,并提出了一个有前途的方向,用于处理重叠事件SED。代码和预训练模型可在https://github.com/apple-yinhan/Noise-robust-SED上获得。
摘要:Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of large language models (LLMs) to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED. Codes and pretrained models are available at https://github.com/apple-yinhan/Noise-robust-SED.

【13】 An incremental algorithm based on multichannel non-negative matrix  partial co-factorization for ambient denoising in auscultation
标题:基于多通道非负矩阵部分余因式分解的听诊环境去噪增量算法
链接:https://arxiv.org/abs/2411.01018
作者:Juan De La Torre Cruz,  Francisco Jesus Canadas Quesada,  Damian Martinez-Munoz,  Nicolas Ruiz Reyes,  Sebastian Garcia Galan,  Julio Jose Carabias Orti
备注:None
摘要:本研究的目的是实现一种方法,以消除环境噪声中捕获的生物医学声音听诊。针对信噪比(SNR)≤-5 dB的高噪声环境,提出了一种基于多通道非负矩阵部分余因子分解(NMPCF)的增量式环境去噪方法。第一个贡献适用于NMPCF假设环境噪声可以被建模为重复的声音事件,同时发现在两个单通道输入通过不同的记录设备捕获。第二个贡献提出了一种增量算法,基于以前的多通道NMPCF,通过消除大部分在前一阶段中没有去除的环境噪声,以保留大部分生物医学光谱内容为代价,在一组增量阶段中改进估计的生物医学光谱图。与一些最相关的最先进的方法相比,所提出的方法的环境去噪性能已经使用由生物医学声音与环境噪声混合组成的一组录音进行了评估,该录音通常围绕医疗咨询室以模拟SNR从-20 dB到-5 dB的高噪声环境。实验结果表明:(ii)与MSS和NLMS发生的情况不同,所提出的方法显示出平均SDR和SIR结果的稳定趋势,而不管环境噪声的类型和所评估的SNR水平如何;以及(iii)显著的优点是当两个单通道输入遭受它们之间的延迟时所估计的生物医学声音的高鲁棒性。
摘要:The aim of this study is to implement a method to remove ambient noise in biomedical sounds captured in auscultation. We propose an incremental approach based on multichannel non-negative matrix partial co-factorization (NMPCF) for ambient denoising focusing on high noisy environment with a Signal-to-Noise Ratio (SNR) <= -5 dB. The first contribution applies NMPCF assuming that ambient noise can be modelled as repetitive sound events simultaneously found in two single-channel inputs captured by means of different recording devices. The second contribution proposes an incremental algorithm, based on the previous multichannel NMPCF, that refines the estimated biomedical spectrogram throughout a set of incremental stages by eliminating most of the ambient noise that was not removed in the previous stage at the expense of preserving most of the biomedical spectral content. The ambient denoising performance of the proposed method, compared to some of the most relevant state-of-the-art methods, has been evaluated using a set of recordings composed of biomedical sounds mixed with ambient noise that typically surrounds a medical consultation room to simulate high noisy environments with a SNR from -20 dB to -5 dB. Experimental results report that: (i) the performance drop suffered by the proposed method is lower compared to MSS and NLMS; (ii) unlike what happens with MSS and NLMS, the proposed method shows a stable trend of the average SDR and SIR results regardless of the type of ambient noise and the SNR level evaluated; and (iii) a remarkable advantage is the high robustness of the estimated biomedical sounds when the two single-channel inputs suffer from a delay between them.

eess.AS音频处理

【1】 Joint Training of Speaker Embedding Extractor, Speech and Overlap  Detection for Diarization
标题:说话人嵌入提取器、语音和重叠检测的联合训练以实现数字化
链接:https://arxiv.org/abs/2411.02165
作者:Petr Pálka,  Federico Landini,  Dominik Klement,  Mireia Diez,  Anna Silnova,  Marc Delcroix,  Lukáš Burget
摘要:尽管如今端到端日记系统很流行,但由语音活动检测(VAD)、说话人嵌入提取加聚类、重叠语音检测(OSD)加处理组成的模块化系统在许多情况下仍然具有竞争力的性能。然而,模块化系统的主要缺点之一是需要独立地运行(和训练)不同的模块。在这项工作中,我们提出了一种方法,联合训练模型,同时产生扬声器嵌入,VAD和OSD,并达到有竞争力的性能在一个标准的方法的推理时间的一小部分。此外,联合推理导致简化的整体管道,这使我们更接近于统一的基于聚类的方法,该方法可以针对特定于diarization的目标进行端到端的训练。
摘要:In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a standard approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

【2】 Complete reconstruction of the tongue contour through acoustic to  articulatory inversion using real-time MRI data
标题:使用实时MRI数据通过声学到发音倒置完全重建舌头轮廓
链接:https://arxiv.org/abs/2411.02037
作者:Sofiane Azzouz,  Pierre-André Vuissoz,  Yves Laprie
摘要:声学发音反转是一个主要的处理挑战,从语音合成到语言学习和康复的反馈系统有着广泛的应用。近年来,深度学习方法已被应用于不到12个几何位置的反演,这些几何位置对应于粘贴在易于访问的关节上的传感器。因此,不可能知道整个舌头从根到尖的形状。在这项工作中,我们使用高质量的实时MRI数据来跟踪舌头的轮廓。因此,用于驱动反演的数据是非结构化语音信号和舌头轮廓。已经探索了依赖于包括或不包括自动编码器的Bi-MSTM来降低潜在空间的维度、使用或不使用语音分割的几种架构。结果表明,舌轮廓可以恢复的中位数精度为2.21毫米(或1.37像素),采取1个MFCC帧(静态,三角洲和双三角洲倒谱特征)的上下文。
摘要:Acoustic articulatory inversion is a major processing challenge, with a wide range of applications from speech synthesis to feedback systems for language learning and rehabilitation. In recent years, deep learning methods have been applied to the inversion of less than a dozen geometrical positions corresponding to sensors glued to easily accessible articulators. It is therefore impossible to know the shape of the whole tongue from root to tip. In this work, we use high-quality real-time MRI data to track the contour of the tongue. The data used to drive the inversion are therefore the unstructured speech signal and the tongue contours. Several architectures relying on a Bi-MSTM including or not an autoencoder to reduce the dimensionality of the latent space, using or not the phonetic segmentation have been explored. The results show that the tongue contour can be recovered with a median accuracy of 2.21 mm (or 1.37 pixel) taking a context of 1 MFCC frame (static, delta and double-delta cepstral features).

【3】 Modulating State Space Model with SlowFast Framework for  Compute-Efficient Ultra Low-Latency Speech Enhancement
标题:使用SlowFast框架调制状态空间模型以实现计算高效的超低延迟语音增强
链接:https://arxiv.org/abs/2411.02019
作者:Longbiao Cheng,  Ashutosh Pandey,  Buye Xu,  Tobi Delbruck,  Vamsi Krishna Ithapu,  Shih-Chii Liu
备注:Submitted to ICASSP 2025
摘要:基于深度学习的语音增强(SE)方法在需要满足低延迟要求时,由于要处理的帧数量增加,通常会面临重大的计算挑战。本文介绍了SlowFast框架,旨在降低计算成本,特别是当需要低延迟增强。该框架由以低帧速率分析声学环境的慢分支和以所需的较高帧速率在时域中执行SE以匹配所需延迟的快分支组成。具体地,快分支采用状态空间模型,其中其状态转换过程由慢分支动态调制。使用Voice Bank + Demand数据集对具有2 ms算法延迟要求的SE任务进行的实验表明,与具有等效参数的基线单分支网络相比,我们的方法将计算成本降低了70%,而不会影响增强性能。此外,通过利用SlowFast框架,我们实现了一个算法延迟仅为60 {\mu}s(16 kHz采样率下的一个采样点)的网络,计算成本为100 M MAC/s,同时PESQ-NB为3.12,SISNR为16.62。
摘要:Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 60 {\mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.

【4】 Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event  Detection
标题:利用LLM和文本查询分离进行噪音稳健的声音事件检测
链接:https://arxiv.org/abs/2411.01174
作者:Han Yin,  Yang Xiao,  Jisheng Bai,  Rohan Kumar Das
备注:Submitted to ICASSP 2025 Workshop
摘要:在噪声环境中,重叠的声音掩盖了目标事件,声音事件检测(SED)是具有挑战性的。音频源分离(LASS)的目的是从噪声片段中分离出目标声音事件。然而,当确切的目标声音未知时,这种方法可能会失败,特别是在嘈杂的测试集中,导致性能降低。为了解决这个问题,我们利用大型语言模型(LLM)的功能来分析和总结声学数据。通过使用LLM来识别和选择特定的噪声类型,我们实现了一种噪声增强方法,用于噪声鲁棒微调。微调的模型被应用于预测剪切式事件预测作为LASS模型的文本查询。我们的研究表明,该方法提高了SED性能在嘈杂的环境中。这项工作代表了噪声鲁棒SED中的LLM的早期应用,并提出了一个有前途的方向,用于处理重叠事件SED。代码和预训练模型可在https://github.com/apple-yinhan/Noise-robust-SED上获得。
摘要:Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of large language models (LLMs) to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED. Codes and pretrained models are available at https://github.com/apple-yinhan/Noise-robust-SED.

【5】 An incremental algorithm based on multichannel non-negative matrix  partial co-factorization for ambient denoising in auscultation
标题:基于多通道非负矩阵部分余因式分解的听诊环境去噪增量算法
链接:https://arxiv.org/abs/2411.01018
作者:Juan De La Torre Cruz,  Francisco Jesus Canadas Quesada,  Damian Martinez-Munoz,  Nicolas Ruiz Reyes,  Sebastian Garcia Galan,  Julio Jose Carabias Orti
备注:None
摘要:本研究的目的是实现一种方法,以消除环境噪声中捕获的生物医学声音听诊。针对信噪比(SNR)≤-5 dB的高噪声环境,提出了一种基于多通道非负矩阵部分余因子分解(NMPCF)的增量式环境去噪方法。第一个贡献适用NMPCF假设环境噪声可以被建模为重复的声音事件,同时发现在两个单通道输入通过不同的记录设备捕获。第二个贡献提出了一种增量算法,基于以前的多通道NMPCF,通过消除大部分在前一阶段中没有去除的环境噪声,以保留大部分生物医学光谱内容为代价,在一组增量阶段中改进估计的生物医学光谱图。与一些最相关的最先进的方法相比,所提出的方法的环境去噪性能已经使用由生物医学声音与环境噪声混合组成的一组录音进行了评估,该录音通常围绕医疗咨询室以模拟SNR从-20 dB到-5 dB的高噪声环境。实验结果表明:(ii)与MSS和NLMS发生的情况不同,所提出的方法显示出平均SDR和SIR结果的稳定趋势,而不管环境噪声的类型和所评估的SNR水平如何;以及(iii)显著的优点是当两个单通道输入遭受它们之间的延迟时所估计的生物医学声音的高鲁棒性。
摘要:The aim of this study is to implement a method to remove ambient noise in biomedical sounds captured in auscultation. We propose an incremental approach based on multichannel non-negative matrix partial co-factorization (NMPCF) for ambient denoising focusing on high noisy environment with a Signal-to-Noise Ratio (SNR) <= -5 dB. The first contribution applies NMPCF assuming that ambient noise can be modelled as repetitive sound events simultaneously found in two single-channel inputs captured by means of different recording devices. The second contribution proposes an incremental algorithm, based on the previous multichannel NMPCF, that refines the estimated biomedical spectrogram throughout a set of incremental stages by eliminating most of the ambient noise that was not removed in the previous stage at the expense of preserving most of the biomedical spectral content. The ambient denoising performance of the proposed method, compared to some of the most relevant state-of-the-art methods, has been evaluated using a set of recordings composed of biomedical sounds mixed with ambient noise that typically surrounds a medical consultation room to simulate high noisy environments with a SNR from -20 dB to -5 dB. Experimental results report that: (i) the performance drop suffered by the proposed method is lower compared to MSS and NLMS; (ii) unlike what happens with MSS and NLMS, the proposed method shows a stable trend of the average SDR and SIR results regardless of the type of ambient noise and the SNR level evaluated; and (iii) a remarkable advantage is the high robustness of the estimated biomedical sounds when the two single-channel inputs suffer from a delay between them.

【6】 3D Audio-Visual Segmentation
标题:3D视听分割
链接:https://arxiv.org/abs/2411.02236
作者:Artem Sokolov,  Swapnil Bhosale,  Xiatian Zhu
备注:Accepted at the NeurIPS 2024 Workshop on Audio Imagination
摘要:识别场景中的发声对象是体现AI的长期目标,在机器人和AR/VR/MR中具有多种应用。为此,视听分割(AVS),以音频信号为条件,利用同步相机和麦克风传感器识别输入图像中目标发声对象的掩码,最近已经取得了进展。然而,这种范例仍然不足以用于现实世界的操作,因为缺少从2D图像到3D场景的映射。为了解决这个根本的限制,我们引入了一个新的研究问题,三维视听分割,扩展现有的AVS的三维输出空间。这个问题提出了更多的挑战,由于变化的摄像机extrinsics,音频散射,闭塞,和不同的声音在发声对象类别。为了促进这项研究,我们创建了第一个基于仿真的基准测试3DAVS-S34-O 7,在单实例和多实例设置下,在34个场景和7个对象类别中提供具有真实感的3D场景环境和接地空间音频。这是通过重新利用生境模拟器来生成探测对象位置和相应3D掩模的综合注释而实现的。随后,我们提出了一种新的方法,EchoSegnet,其特征在于通过空间音频感知掩码对齐和细化,将来自预训练的2D视听基础模型的现成知识与3D视觉场景表示协同集成。大量的实验表明,EchoSegnet可以在我们的新基准上有效地分割3D空间中的发声对象,这代表了体现AI领域的重大进步。项目页面:https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/
摘要:Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/

【7】 Addressing Representation Collapse in Vector Quantized Models with One  Linear Layer
标题:解决具有一个线性层的载体量化模型中的崩溃表示
链接:https://arxiv.org/abs/2411.02038
作者:Yongxin Zhu,  Bocheng Li,  Yifei Xin,  Linli Xu
摘要:矢量量化(VQ)是一种广泛使用的将连续表示转换为离散代码的方法,它已成为无监督表示学习和潜在生成模型的基础。然而,VQ模型往往受到潜在空间中的表示崩溃问题的阻碍,这导致低码本利用率,并限制了大规模训练的码本的可扩展性。现有的方法,旨在减轻表示崩溃通常减少潜在空间的维数在模型容量的代价,这并没有完全解决的核心问题。在这项研究中,我们进行了理论分析的表示崩溃的VQ模型,并确定其主要原因是不相交的优化的码本,其中只有一小部分的代码矢量更新通过梯度下降。为了解决这个问题,我们提出了\textbf{SimVQ},这是一种新的方法,它通过基于可学习潜在基础的线性变换层重新参数化代码向量。这种变换优化了码本所覆盖的\textit{整个线性空间},而不仅仅是更新vanilla VQ模型中最近邻搜索所选择的\textit{码向量}。虽然通常认为两个线性矩阵的乘法相当于应用单个线性层,但我们的方法在解决仅具有一个线性层的VQ模型中的崩溃问题时效果令人惊讶。我们通过各种形式的广泛实验验证了SimVQ的有效性,包括具有不同模型架构的图像和音频数据。我们的代码可以在\url{https://github.com/Simsheen/SimVQ}上找到。
摘要:Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbf{SimVQ}, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textit{entire linear space} spanned by the codebook, rather than merely updating \textit{the code vector} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at \url{https://github.com/youngsheen/SimVQ}.

【8】 CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre  Ensemble Modeling and Flow Matching
标题:CTEFM-VC:基于内容感知音色集合建模和流匹配的Zero-Shot语音转换
链接:https://arxiv.org/abs/2411.02026
作者:Yu Pan,  Yuguang Yang,  Jixun Yao,  Jianhao Ye,  Hongbin Zhou,  Lei Ma,  Jianjun Zhao
备注:Work in progress; 5 pages;
摘要:Zero-shot语音转换(VC)的目的是将源说话人的音色转换为任何以前看不见的目标说话人,同时保留原始的语言内容。尽管取得了显著进展,但要达到与地面实况录音相当的说话者相似度和自然度,仍然是一个巨大的挑战。在本文中,我们提出了CTEFM-VC,一个zero-shot VC框架,利用内容感知的音色包围建模和流匹配。具体而言,CTEFM-VC将话语分解为语言内容和音色表示,随后利用条件流匹配模型和声码器来重建梅尔频谱图和波形。为了提高其音色建模能力和生成的语音的自然性,我们提出了一种上下文感知的音色集成建模方法,自适应地集成了不同的说话人验证嵌入,并通过一个交叉注意模块,使语言和音色功能的联合利用。实验表明,我们的CTEFM-VC系统超越国家的最先进的VC方法在说话人相似度和自然度至少18.5%和7.0%。
摘要:Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that leverages Content-aware Timbre Ensemble modeling and Flow Matching. Specifically, CTEFM-VC disentangles utterances into linguistic content and timbre representations, subsequently utilizing a conditional flow matching model and a vocoder to reconstruct the mel-spectrogram and waveform. To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the joint utilization of linguistic and timbre features through a cross-attention module. Experiments show that our CTEFM-VC system surpasses state-of-the-art VC methods in both speaker similarity and naturalness by at least 18.5% and 7.0%.

【9】 Align-SLM: Textless Spoken Language Models with Reinforcement Learning  from AI Feedback
标题:Align-LAM:无文本口语模型,采用人工智能反馈强化学习
链接:https://arxiv.org/abs/2411.01834
作者:Guan-Ting Lin,  Prashanth Gurunath Shivakumar,  Aditya Gourav,  Yile Gu,  Ankur Gandhe,  Hung-yi Lee,  Ivan Bulyko
摘要:虽然无文本口语模型(SLM)在端到端语音到语音建模中表现出了潜力,但在语义一致性和相关性方面仍然落后于基于文本的大型语言模型(LLM)。这项工作介绍了Align-SLM框架,该框架利用带有AI反馈的强化学习(RLAIF)启发的偏好优化来增强SLM的语义理解。我们的方法生成多个语音连续从一个给定的提示,并使用语义指标来创建直接偏好优化(DPO)的偏好数据。我们使用ZeroSpeech 2021词汇和语法建模基准、StoryCloze数据集的口语版本(用于语义一致性)以及其他语音生成指标(包括GPT 4-o评分和人工评估)来评估该框架。实验结果表明,我们的方法实现了国家的最先进的性能SLM在大多数基准测试,突出的重要性,偏好优化,以提高语义的SLM。
摘要:While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

【10】 MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and  Correspondence
标题:MoMu-扩散:关于学习长期运动-音乐同步和对应
链接:https://arxiv.org/abs/2411.01805
作者:Fuming You,  Minghui Fang,  Li Tang,  Rongjie Huang,  Yongqi Wang,  Zhou Zhao
备注:NeurIPS 2024
摘要:运动到音乐和音乐到运动已经被单独研究,每个都在各自的领域吸引了大量的研究兴趣。人体运动与音乐之间的相互作用是人类高级智能的体现,在它们之间建立统一的关系尤为重要。然而,迄今为止,还没有任何工作,认为他们共同探讨内部的模式调整。为了弥合这一差距,我们提出了一个新的框架,称为MoMu-Diffusion,用于长期和同步的运动音乐生成。首先,为了减轻长序列带来的巨大计算成本,我们提出了一种新的双向对比节奏变分自动编码器(BiCoR-VAE),它可以提取运动和音乐输入的模态对齐潜在表示。随后,利用对齐的潜在空间,我们引入了基于多模态变换器的扩散模型和交叉指导采样策略,以实现各种生成任务,包括跨模态,多模态和可变长度生成。大量的实验表明,MoMu-Diffusion在定性和定量上都超过了最近的最先进的方法,并且可以合成逼真的,多样的,长期的和节拍匹配的音乐或运动序列。生成的示例和代码可在https://momu-diffusion.github.io/上获得
摘要:Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establishing a unified relationship between them is particularly important. However, to date, there has been no work that considers them jointly to explore the modality alignment within. To bridge this gap, we propose a novel framework, termed MoMu-Diffusion, for long-term and synchronous motion-music generation. Firstly, to mitigate the huge computational costs raised by long sequences, we propose a novel Bidirectional Contrastive Rhythmic Variational Auto-Encoder (BiCoR-VAE) that extracts the modality-aligned latent representations for both motion and music inputs. Subsequently, leveraging the aligned latent spaces, we introduce a multi-modal Transformer-based diffusion model and a cross-guidance sampling strategy to enable various generation tasks, including cross-modal, multi-modal, and variable-length generation. Extensive experiments demonstrate that MoMu-Diffusion surpasses recent state-of-the-art methods both qualitatively and quantitatively, and can synthesize realistic, diverse, long-term, and beat-matched music or motion sequences. The generated samples and codes are available at https://momu-diffusion.github.io/

【11】 SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
标题:SPES:可解释语音到文本生成的频谱图微扰
链接:https://arxiv.org/abs/2411.01710
作者:Dennis Fucci,  Marco Gaido,  Beatrice Savoldi,  Matteo Negri,  Mauro Cettolo,  Luisa Bentivogli
摘要:在对可解释模型的需求的推动下,针对语言技术的可解释人工智能的研究经历了显著的增长,特征归因方法成为这一进展的基石。虽然NLP之前的工作探索了分类任务和文本应用的方法,但可解释性交叉生成和语音是滞后的,现有技术未能解释最先进模型的自回归性质,并提供细粒度,语音有意义的解释。我们通过引入用于可解释语音到文本生成(SPES)的频谱图扰动来解决这一差距,SPES是一种适用于自回归模型序列生成任务的特征归因技术。SPES基于输入的频谱图和先前生成的标记为每个预测的标记提供解释。语音识别和翻译的广泛评估表明,SPES生成的解释是忠实的,似是而非的人。
摘要:Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques failing to account for the autoregressive nature of state-of-the-art models and to provide fine-grained, phonetically meaningful explanations. We address this gap by introducing Spectrogram Perturbation for Explainable Speech-to-text Generation (SPES), a feature attribution technique applicable to sequence generation tasks with autoregressive models. SPES provides explanations for each predicted token based on both the input spectrogram and the previously generated tokens. Extensive evaluation on speech recognition and translation demonstrates that SPES generates explanations that are faithful and plausible to humans.

【12】 Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations
标题:即兴唱歌:简单的文本可控制的陪伴一代
链接:https://arxiv.org/abs/2411.01661
作者:Quoc-Huy Trinh,  Minh-Van Nguyen,  Trong-Hieu Nguyen Mau,  Khoa Tran,  Thanh Do
摘要:唱歌是人类最珍贵的娱乐形式之一。然而,创作一首优美的歌曲需要一个伴奏,补充人声,并与歌曲乐器和流派保持一致。随着深度学习的进步,以前的研究主要集中在生成合适的乐器,但往往缺乏与所需乐器和流派的精确对齐。为了解决这个问题,我们提出了一个简单的方法,通过文本提示控制伴奏,允许生成补充人声并符合歌曲乐器和流派要求的音乐。通过大量的实验,我们成功地产生了10秒的语音输入和文本控制。
摘要:Singing is one of the most cherished forms of human entertainment. However, creating a beautiful song requires an accompaniment that complements the vocals and aligns well with the song instruments and genre. With advancements in deep learning, previous research has focused on generating suitable accompaniments but often lacks precise alignment with the desired instrumentation and genre. To address this, we propose a straightforward method that enables control over the accompaniment through text prompts, allowing the generation of music that complements the vocals and aligns with the song instrumental and genre requirements. Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.

【13】 Fish-Speech: Leveraging Large Language Models for Advanced Multilingual  Text-to-Speech Synthesis
标题:Fish-Speech:利用大型语言模型进行高级多语言文本到语音合成
链接:https://arxiv.org/abs/2411.01156
作者:Shijia Liao,  Yuxuan Wang,  Tianyu Li,  Yifan Cheng,  Ruoyi Zhang,  Rongzhi Zhou,  Yijin Xing
摘要:文本到语音(TTS)系统在处理复杂的语言特征、处理复调表达和产生自然的多语言语音方面面临着持续的挑战,这些能力对未来的人工智能应用至关重要。在本文中,我们提出了鱼语音,一种新的框架,实现了一个串行的快-慢双自回归(Dual-AR)架构,以提高稳定性的分组有限标量矢量量化(GFSQ)序列生成任务。这种架构提高了码本处理效率,同时保持了高保真输出,使其对AI交互和语音克隆特别有效。Fish-Speech利用大型语言模型(LLM)进行语言特征提取,消除了传统的字形到音素(G2 P)转换的需要,从而简化了合成管道并增强了多语言支持。此外,我们通过GFSQ开发了FF-GAN,以实现优异的压缩比和接近100%的码本利用率。我们的方法解决了目前的TTS系统的关键限制,同时提供了一个更复杂的,上下文感知的语音合成的基础。实验结果表明,Fish-Speech在处理复杂的语言场景和语音克隆任务方面明显优于基线模型,展示了其在人工智能应用中推进TTS技术的潜力。该实现在\href{https://github.com/fishaudio/fish-speech}{https://github.com/fishaudio/fish-speech}上开源。
摘要:Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning.  Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization.  Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{https://github.com/fishaudio/fish-speech}{https://github.com/fishaudio/fish-speech}.

【14】 Music Foundation Model as Generic Booster for Music Downstream Tasks
标题:音乐基金会模型作为音乐下游任务的通用助推器
链接:https://arxiv.org/abs/2411.01135
作者:WeiHsiang Liao,  Yuhta Takida,  Yukara Ikemiya,  Zhi Zhong,  Chieh-Hsin Lai,  Giorgio Fabbro,  Kazuki Shimada,  Keisuke Toyama,  Kinwai Cheuk,  Marco Martinez,  Shusuke Takahashi,  Stefan Uhlich,  Taketo Akama,  Woosung Choi,  Yuichiro Koyama,  Yuki Mitsufuji
备注:41 pages with 14 figures
摘要:我们证明了使用中间表示从一个单一的基础模型,以提高各种音乐下游任务的有效性。我们介绍SoniDo,音乐基础模型(MFM),旨在从目标音乐样本中提取层次特征。通过利用分层中间功能,SoniDo限制了信息粒度,从而提高了各种下游任务(包括理解和生成任务)的性能。我们专门评估了这种方法的代表性任务,如音乐标记,音乐转录,音乐源分离和音乐混合。我们的研究结果表明,从基础模型中提取的特征为训练下游任务模型提供了有价值的增强。这突出了使用从音乐基础模型中提取的特征作为下游任务的助推器的能力。我们的方法不仅有利于现有的特定任务的模型,但也支持音乐下游的数据稀缺性的限制任务。这为更有效和更容易获得的音乐处理解决方案铺平了道路。
摘要:We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo , a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

【15】 Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An  Evaluation Using TORGO
标题:增强电子健康环境中针对发音障碍者的AAC软件:使用TORGO的评估
链接:https://arxiv.org/abs/2411.00980
作者:Macarious Hui,  Jinda Zhang,  Aanchan Mohan
摘要:患有脑瘫(CP)和肌萎缩侧索硬化症(ALS)的个体经常面临发音的挑战,导致构音障碍并导致非典型的语音模式。在医疗机构中,通信故障会降低护理质量。在构建增强和替代通信(AAC)工具以实现流畅通信时,我们发现,最先进的(SOTA)自动语音识别(ASR)技术(如Whisper和Wav2vec2.0)在很大程度上由于缺乏训练数据而使非典型扬声器边缘化。我们的工作着眼于利用SOTA ASR,然后进行特定于域的纠错。英语构音障碍的ASR表现通常在TORGO数据集上进行评估。重叠是这个数据集的一个众所周知的问题,其中训练和测试说话者之间的短语重叠。我们的工作提出了一个算法来打破这种重叠。在减少发音重叠后,SOTA ASR模型的结果对轻度和重度构音障碍的说话者产生了极高的单词错误率。此外,为了改善ASR,我们的工作着眼于n-gram语言模型和基于大语言模型(LLM)的多模态生成纠错算法(如Whispering-LLaMA)对第二遍ASR的影响。我们的工作强调了需要做更多的工作来改善非典型扬声器的ASR,以实现面对面和电子健康环境中的公平医疗保健。
摘要:Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. In healthcare settings, coomunication breakdowns reduce the quality of care. While building an augmentative and alternative communication (AAC) tool to enable fluid communication we found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction. English dysarthric ASR performance is often evaluated on the TORGO dataset. Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. After reducing prompt-overlap, results with SOTA ASR models produce extremely high word error rates for speakers with mild and severe dysarthria. Furthermore, to improve ASR, our work looks at the impact of n-gram language models and large-language model (LLM) based multi-modal generative error-correction algorithms like Whispering-LLaMA for a second pass ASR. Our work highlights how much more needs to be done to improve ASR for atypical speakers to enable equitable healthcare access both in-person and in e-health settings.

【16】 Personality Analysis from Online Short Video Platforms with Multi-domain  Adaptation
标题:多领域适应的在线短视频平台性格分析
链接:https://arxiv.org/abs/2411.00813
作者:Sixu An,  Xiangguo Sun,  Yicong Li,  Yu Yang,  Guandong Xu
摘要:在线短视频中的个性分析由于其在个性化推荐系统、情感分析和人机交互中的应用而获得了重视。传统的评估方法,如基于大五人格框架的问卷调查,受到自我报告偏差的限制,无法进行大规模或实时分析。利用短视频中丰富的多模态数据,为更准确的人格推断提供了一种有前途的替代方案。然而,集成这些不同的异步模式带来了巨大的挑战,特别是在调整时变数据和确保模型很好地推广到具有有限标记数据的新领域方面。在本文中,我们提出了一种新的多模态人格分析框架,通过同步和整合来自多个模态的特征,并通过域自适应增强模型泛化,来解决这些挑战。我们引入了一种基于时间戳的模态对齐机制,该机制基于口语单词时间戳来识别数据,确保模态之间的准确对应,并促进有效的特征集成。为了捕捉时间依赖性和模态间的相互作用,我们采用了双向长短期记忆网络和自我注意机制,使模型能够专注于人格预测的最具信息性的特征。此外,我们开发了一种基于梯度的域自适应方法,从多个源域转移知识,以提高在目标域的性能与稀缺的标记数据。在真实世界数据集上进行的大量实验表明,我们的框架在人格预测任务中的表现明显优于现有方法,突出了其在捕捉复杂行为线索方面的有效性和适应新领域的鲁棒性。
摘要:Personality analysis from online short videos has gained prominence due to its applications in personalized recommendation systems, sentiment analysis, and human-computer interaction. Traditional assessment methods, such as questionnaires based on the Big Five Personality Framework, are limited by self-report biases and are impractical for large-scale or real-time analysis. Leveraging the rich, multi-modal data present in short videos offers a promising alternative for more accurate personality inference. However, integrating these diverse and asynchronous modalities poses significant challenges, particularly in aligning time-varying data and ensuring models generalize well to new domains with limited labeled data. In this paper, we propose a novel multi-modal personality analysis framework that addresses these challenges by synchronizing and integrating features from multiple modalities and enhancing model generalization through domain adaptation. We introduce a timestamp-based modality alignment mechanism that synchronizes data based on spoken word timestamps, ensuring accurate correspondence across modalities and facilitating effective feature integration. To capture temporal dependencies and inter-modal interactions, we employ Bidirectional Long Short-Term Memory networks and self-attention mechanisms, allowing the model to focus on the most informative features for personality prediction. Furthermore, we develop a gradient-based domain adaptation method that transfers knowledge from multiple source domains to improve performance in target domains with scarce labeled data. Extensive experiments on real-world datasets demonstrate that our framework significantly outperforms existing methods in personality prediction tasks, highlighting its effectiveness in capturing complex behavioral cues and robustness in adapting to new domains.

机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区

语音之家
助力AI语音开发者的社区
 最新文章