本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2411.07186
备注:Demo page: this https URL The code will be open-sourced and available shortly
摘要:用文本和音频提示的大型语言模型(LLM)代表了各种听觉任务的最新技术水平,包括语音,音乐和一般音频,在看不见的任务上显示出紧急能力。然而,这些能力尚未在生物声学任务中得到充分证明,例如检测大型录音中的动物发声,对稀有和濒危物种进行分类,以及标记上下文和行为-这些任务对于保护,生物多样性监测和动物行为研究至关重要。在这项工作中,我们提出了NatureLM音频,第一个音频语言基础模型专门为生物声学设计。我们精心策划的训练数据集包括跨越各种生物声学,语音和音乐数据的文本音频对,旨在解决该领域有限的注释数据集所带来的挑战。我们展示了成功的转让学习表示从音乐和语音生物声学,我们的模型显示出有前途的推广看不见的类群和任务。重要的是,我们在一个新的基准(BEANS-Zero)上测试了NatureLM音频,它在几个生物声学任务上设置了新的最先进的技术(SotA),包括对看不见的物种的zero-shot分类。为了推进生物声学研究,我们还开源了生成训练和基准数据以及训练模型的代码。
摘要:Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior - tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.
标题:对人体位置鲁棒的基于声学的3D人体姿势估计
链接:https://arxiv.org/abs/2411.07165
备注:Accepted at BMVC2024
摘要:本文探讨的问题,三维人体姿态估计从只有低电平的声学信号。现有的基于主动声学传感的3D人体姿态估计方法隐含地假设目标用户沿着扬声器和麦克风之间的线定位。由于人体对声音的反射和衍射与声音障碍相比会引起细微的声学信号变化,因此当受试者偏离这条线时,现有模型的准确性会显着降低,从而限制了其在现实世界场景中的实用性。为了克服这一局限性,我们提出了一种新的方法组成的位置偏移和抗混响模型。前者预测对象的站立位置,并应用对抗学习来提取对象位置不变的特征。后者利用估计目标时间之前的声学信号作为参考,以增强对由于衍射和反射引起的声音到达时间变化的鲁棒性。我们构建了一个声学姿态估计数据集,涵盖了不同的人类位置,并通过实验证明,我们提出的方法优于现有的方法。
摘要:This paper explores the problem of 3D human pose estimation from only low-level acoustic signals. The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along a line between loudspeakers and a microphone. Because reflection and diffraction of sound by the human body cause subtle acoustic signal changes compared to sound obstruction, the existing model degrades its accuracy significantly when subjects deviate from this line, limiting its practicality in real-world scenarios. To overcome this limitation, we propose a novel method composed of a position discriminator and reverberation-resistant model. The former predicts the standing positions of subjects and applies adversarial learning to extract subject position-invariant features. The latter utilizes acoustic signals before the estimation target time as references to enhance robustness against the variations in sound arrival times due to diffraction and reflection. We construct an acoustic pose estimation dataset that covers diverse human locations and demonstrate through experiments that our proposed method outperforms existing approaches.
标题:建立台湾普通话口语模型:首次尝试
链接:https://arxiv.org/abs/2411.07111
备注:Work in progress
摘要:本技术报告介绍了我们为台湾普通话构建口语大语言模型(LLM)的初步尝试,该模型专门针对多轮对话中的实时语音到语音交互。我们的端到端模型采用了仅解码器的Transformer架构,旨在实现无缝交互,同时保留会话流,包括允许同时说话和倾听的全双工功能。本文还详细介绍了培训过程,包括数据准备与合成对话和实时交互的调整。我们还开发了一个平台来评估多轮对话中的会话流畅性和响应一致性。我们希望这份报告的发布能为台湾普通话口语LLM的未来发展做出贡献。
摘要:This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.
标题:基于Mamba的纯解码器、双向语音建模的语音识别方法
链接:https://arxiv.org/abs/2411.06968
备注:Accepted to SLT 2024
摘要:以Mamba为代表的选择性状态空间模型(SSM)在包括自动语音识别(ASR)在内的各种任务中已经证明了它们的计算效率和有希望的结果。在基于注意力的编码器-解码器框架下,Mamba被应用于ASR任务,其中编码器和解码器之间的交叉注意机制仍然存在。本文探讨了Mamba作为ASR任务中的仅解码器架构的能力。我们基于MAmba的解码器ONly方法(MADEON)由一个解码器组成,该解码器以语音标记为条件,并以自回归方式预测文本标记。为了增强MADEON,我们进一步提出了语音前缀,对语音标记进行双向处理,从而丰富了隐藏状态中的上下文信息。我们的实验表明,MADEON的性能显着优于非选择性SSM。语音前缀和最近提出的Mamba-2的组合在大型数据集上产生了与基于Transformer的模型相当的性能。
摘要:Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on large datasets.
标题:基于脑电波的声音空间谱多类解码出席者方向
链接:https://arxiv.org/abs/2411.06928
摘要:从听众的脑电图(EEG)信号解码出演讲者的方向性焦点对于开发脑机接口以改善听力障碍患者的生活质量至关重要。先前的工作集中于二进制方向聚焦解码,即,确定所述被关注的说话者是在所述收听者的左侧还是右侧。然而,一个更精确的解码的确切方向出席发言人是必要的有效的语音处理。另外,音频空间信息没有被有效地利用,导致次优的解码结果。在本文中,我们观察到,在我们最近提出的具有15类方向性焦点的数据集上,完全依赖于EEG输入的模型在Leave-one-subject-out和Leave-one-trial-out场景中解码方向性焦点时表现出显着较低的准确性。通过将音频空间谱与脑电特征相结合,可以有效地提高解码精度。我们采用CNN,LSM-CNN,和EEG变形器模型解码的方向性焦点从听众的EEG信号与辅助音频空间频谱。所提出的Sp-Aux-Deformer模型在leave-one-subject-out和leave-one-trial-out场景中分别实现了57.48%和61.83%的15类解码准确率。
摘要:Decoding the directional focus of an attended speaker from listeners' electroencephalogram (EEG) signals is essential for developing brain-computer interfaces to improve the quality of life for individuals with hearing impairment. Previous works have concentrated on binary directional focus decoding, i.e., determining whether the attended speaker is on the left or right side of the listener. However, a more precise decoding of the exact direction of the attended speaker is necessary for effective speech processing. Additionally, audio spatial information has not been effectively leveraged, resulting in suboptimal decoding results. In this paper, we observe that, on our recently presented dataset with 15-class directional focus, models relying exclusively on EEG inputs exhibits significantly lower accuracy when decoding the directional focus in both leave-one-subject-out and leave-one-trial-out scenarios. By integrating audio spatial spectra with EEG features, the decoding accuracy can be effectively improved. We employ the CNN, LSM-CNN, and EEG-Deformer models to decode the directional focus from listeners' EEG signals with the auxiliary audio spatial spectra. The proposed Sp-Aux-Deformer model achieves notable 15-class decoding accuracies of 57.48% and 61.83% in leave-one-subject-out and leave-one-trial-out scenarios, respectively.
标题:罗莎娜洗牌的时机和动态
链接:https://arxiv.org/abs/2411.06892
备注:22 pages, 12 figures
摘要:Rosanna shuffle,来自Toto 1982年的热门歌曲“Rosanna”的鼓模式,是流行音乐中最受认可的鼓点之一。由杰夫Porcaro记录,这个鼓节拍具有快速三胞胎上踩帽和小鼓半场洗牌。在这个分析中,我们研究了原始鼓轨道的时间和动态,重点是节奏变化,如摆动因素,微定时偏差,节奏漂移,以及踩帽模式的整体动态。我们的研究结果表明,“罗莎娜”表现出令人惊讶的明显摆动其流派,随着显着的节奏漂移典型的轨道记录没有节拍器。此外,我们观察到明确的长程相关性的微定时偏差,与以前的研究一致。值得注意的是,这首歌的两小节乐句在踩音节拍的时间和力度上有着独特的重复模式,这增强了歌曲的乐句。总的来说,Rosanna shuffle拥有丰富的节奏特征,巩固了其在音乐史上的重要地位。
摘要:The Rosanna shuffle, the drum pattern from Toto's 1982 hit "Rosanna", is one of the most recognized drum beats in popular music. Recorded by Jeff Porcaro, this drum beat features a half-time shuffle with rapid triplets on the hi-hat and snare drum. In this analysis, we examine the timing and dynamics of the original drum track, focusing on rhythmic variations such as swing factor, microtiming deviations, tempo drift, and the overall dynamics of the hi-hat pattern. Our findings reveal that "Rosanna" exhibits a surprisingly pronounced swing for its genre, along with notable tempo drift typical of tracks recorded without a metronome. Additionally, we observe clear long-range correlations in the microtiming deviations, consistent with previous studies. Notably, the two-bar phrases of the song feature a distinctive repeating pattern in the timing and dynamics of the hi-hat beats, which enhances the song's phrasing. Overall, the Rosanna shuffle boasts a rich array of rhythmic characteristics that solidify its significant place in music history.
标题:Wavehax:基于2D卷积和调和先验的无混叠神经波合成,用于可靠的复谱图估计
链接:https://arxiv.org/abs/2411.06807
备注:13 pages, 5 figures, Submitted to IEEE/ACM Trans. ASLP
摘要:神经声码器经常与潜在特征空间中的混叠作斗争,这是由时域非线性操作和恢复层引起的。混叠将高频分量折叠到低频范围中,使得混叠的频率分量和原始频率分量无法区分,并引入两个实际问题。首先,混叠使波形生成过程复杂化,因为后续层必须解决这些混叠效应,从而增加了计算复杂性。其次,它限制了外推性能,特别是在处理高基频时,这降低了所生成的语音波形的感知质量。本文证明了1)时域非线性操作不可避免地引入混叠,但为谐波生成提供了强的电感偏置,以及2)时频域处理可以实现无混叠波形合成,但缺乏有效谐波生成的电感偏置。基于这一见解,我们提出了Wavehax,一种无混叠的神经波形生成器,它集成了2D卷积和HARmonic先验,用于可靠的复杂频谱图估计。实验结果表明,Wavehax实现了与现有高保真神经声码器相当的语音质量,并在需要高基频外推的情况下表现出出色的鲁棒性,其中混叠效应通常变得严重。此外,与HiFi-GAN V1相比,Wavehax只需要不到5%的乘法累加运算和模型参数,同时实现了超过四倍的CPU推理速度。
摘要:Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.
标题:神经脉冲响应场的声学体绘制
链接:https://arxiv.org/abs/2411.06307
备注:NeurIPS 2024 Spotlight
摘要:捕捉准确声学现象的逼真音频合成对于在虚拟和增强现实中创建沉浸式体验至关重要。合成在任何位置接收的声音依赖于对脉冲响应(IR)的估计,其表征声音在到达收听者的位置之前如何在一个场景中沿着不同路径传播。在本文中,我们提出了声学体绘制(AVR),一种新的方法,适应体绘制技术来模拟声脉冲响应。虽然体绘制已经成功地为图像和神经场景表示建模辐射场,但IR作为时间序列信号提出了独特的挑战。为了解决这些挑战,我们引入频域体绘制,并使用球面积分来适应IR测量。我们的方法构建了一个脉冲响应字段,该字段固有地编码波传播原理,并在合成新姿势的脉冲响应方面实现了最先进的性能。实验表明,AVR大大优于目前的领先方法。此外,我们开发了一个声学仿真平台,AcoustiX,它提供了比现有的模拟器更准确和逼真的红外模拟。AVR和AcoustiX的代码可在https://zitonglan.github.io/avr上获得。
摘要:Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acoustic Volume Rendering (AVR), a novel approach that adapts volume rendering techniques to model acoustic impulse responses. While volume rendering has been successful in modeling radiance fields for images and neural scene representations, IRs present unique challenges as time-series signals. To address these challenges, we introduce frequency-domain volume rendering and use spherical integration to fit the IR measurements. Our method constructs an impulse response field that inherently encodes wave propagation principles and achieves state-of-the-art performance in synthesizing impulse responses for novel poses. Experiments show that AVR surpasses current leading methods by a substantial margin. Additionally, we develop an acoustic simulation platform, AcoustiX, which provides more accurate and realistic IR simulations than existing simulators. Code for AVR and AcoustiX are available at https://zitonglan.github.io/avr.
标题:低频低位深信号类型和严重度的智能故障诊断
链接:https://arxiv.org/abs/2411.06299
摘要:本研究的重点是智能故障诊断(IFD)在旋转机械利用一个单一的麦克风和数据驱动的方法,有效地诊断42类故障类型和严重程度。该研究利用了来自不平衡MaFaulDa数据集的可靠数据,旨在实现高性能和低资源消耗之间的平衡。测试阶段包括各种配置,包括采样,量化,信号归一化,静音消除,维纳滤波,数据缩放,开窗,增强和分类器调整使用XGBoost。通过对时间、频率、梅尔频率和统计特征的分析,我们在8 kHz、8位配置下仅用6棵提升树就实现了99.54%的准确率和99.52%的F-Beta分数。此外,当仅使用MFCC及其一阶和二阶Δ时,我们记录了97.83%的准确性和97.67%的F-Beta得分。最后,通过实施一个贪婪的包装方法,我们获得了显着的准确率为96.82%和F-Beta得分为98.86%,使用50个选定的功能,几乎所有这些都是第一和第二阶三角洲的MFCC。
摘要:This study focuses on Intelligent Fault Diagnosis (IFD) in rotating machinery utilizing a single microphone and a data-driven methodology, effectively diagnosing 42 classes of fault types and severities. The research leverages sound data from the imbalanced MaFaulDa dataset, aiming to strike a balance between high performance and low resource consumption. The testing phase encompassed a variety of configurations, including sampling, quantization, signal normalization, silence removal, Wiener filtering, data scaling, windowing, augmentation, and classifier tuning using XGBoost. Through the analysis of time, frequency, mel-frequency, and statistical features, we achieved an impressive accuracy of 99.54% and an F-Beta score of 99.52% with just 6 boosting trees at an 8 kHz, 8-bit configuration. Moreover, when utilizing only MFCCs along with their first- and second-order deltas, we recorded an accuracy of 97.83% and an F-Beta score of 97.67%. Lastly, by implementing a greedy wrapper approach, we obtained a remarkable accuracy of 96.82% and an F-Beta score of 98.86% using 50 selected features, nearly all of which were first- and second-order deltas of the MFCCs.
标题:迈向音频Deepfake识别的跨学科方法
链接:https://arxiv.org/abs/2411.05969
摘要:这一观点呼吁跨学科的学者通过人工智能方法和语言学的跨学科视角来解决音频深度伪造检测和识别的挑战。一方面,有大量的工具可以生成听起来很逼真的假语音,另一方面,对deepfakes的检测却滞后了。特别阻碍音频deepfake检测的是,目前的人工智能模型缺乏对语言固有可变性以及人类语音复杂性和独特性的充分理解。我们看到了最近跨学科工作的巨大潜力,这些工作将语言知识融入人工智能方法中,为专家在环提供途径,并超越基于专家不可知的人工智能方法,以实现更强大、更全面的deepfake检测。
摘要:This perspective calls for scholars across disciplines to address the challenge of audio deepfake detection and discernment through an interdisciplinary lens across Artificial Intelligence methods and linguistics. With an avalanche of tools for the generation of realistic-sounding fake speech on one side, the detection of deepfakes is lagging on the other. Particularly hindering audio deepfake detection is the fact that current AI models lack a full understanding of the inherent variability of language and the complexities and uniqueness of human speech. We see the promising potential in recent transdisciplinary work that incorporates linguistic knowledge into AI approaches to provide pathways for expert-in-the-loop and to move beyond expert agnostic AI-based methods for more robust and comprehensive deepfake detection.
标题:迈向多模式掌握:4.5B参数真正的多模式小语言模型
链接:https://arxiv.org/abs/2411.05903
摘要:我们提出了一种新的4.5B参数小语言模型,可以处理多种输入和输出方式,包括文本,图像,视频和音频。尽管它的尺寸很小,但该模型在各种任务上都达到了接近最先进的性能,展示了多模态模型解决复杂现实问题的潜力。我们的方法利用了语言建模和多任务学习的最新进展,创建了一个多功能和高性能的模型,甚至可以部署用于边缘推理。实验结果表明,该模型在多个基准测试中表现出色,为多模态人工智能的进一步发展铺平了道路。
摘要:We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model's strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.
标题:阿拉伯语语音识别中的方言覆盖和概括
链接:https://arxiv.org/abs/2411.05872
摘要:开发强大的自动语音识别(ASR)系统的阿拉伯语,其特点是其丰富的方言多样性,往往被认为是一种低资源的语言在语音技术,需要有效的策略来管理其复杂性。本研究探讨了影响ASR性能的三个关键因素:方言覆盖率在预训练中的作用,与多方言方法相比,方言特定微调的有效性,以及概括看不见的方言的能力。通过对不同方言组合的广泛实验,我们的研究结果为推进阿拉伯语等多中心语言的ASR系统的发展提供了关键见解。
摘要:Developing robust automatic speech recognition (ASR) systems for Arabic, a language characterized by its rich dialectal diversity and often considered a low-resource language in speech technology, demands effective strategies to manage its complexity. This study explores three critical factors influencing ASR performance: the role of dialectal coverage in pre-training, the effectiveness of dialect-specific fine-tuning compared to a multi-dialectal approach, and the ability to generalize to unseen dialects. Through extensive experiments across different dialect combinations, our findings offer key insights towards advancing the development of ASR systems for pluricentric languages like Arabic.
标题:超越相关性:使用约束一致性指数评估多媒体质量模型
链接:https://arxiv.org/abs/2411.05794
摘要:本研究探讨多媒体质量模型的评估,侧重于主观平均意见分数(MOS)评级由于评级者的不一致性和偏见等因素的固有的不确定性。传统的统计方法如Pearson相关系数(PCC)、Spearman秩相关系数(SRCC)和Kendall Tau(KTAU)往往无法考虑这些不确定性,导致模型性能评估不准确。我们介绍了约束一致性指数(CCI),一种新的度量,旨在克服现有的指标的局限性,考虑MOS差异的统计意义,并排除MOS置信区间重叠的比较。通过在包括语音和图像质量评估在内的各个领域的综合实验,我们证明了CCI提供了一个更强大和准确的工具质量模型的评估,特别是在低样本量,评分员群体变异性和范围限制的情况下。我们的研究结果表明,将评分者的主观性和专注于统计上显着的对可以显着提高多媒体质量预测模型的评估框架。这项工作不仅揭示了主观评级不确定性的被忽视的方面,但也提出了更可靠和更准确的质量模型评估的方法进步。
摘要:This study investigates the evaluation of multimedia quality models, focusing on the inherent uncertainties in subjective Mean Opinion Score (MOS) ratings due to factors like rater inconsistency and bias. Traditional statistical measures such as Pearson's Correlation Coefficient (PCC), Spearman's Rank Correlation Coefficient (SRCC), and Kendall's Tau (KTAU) often fail to account for these uncertainties, leading to inaccuracies in model performance assessment. We introduce the Constrained Concordance Index (CCI), a novel metric designed to overcome the limitations of existing metrics by considering the statistical significance of MOS differences and excluding comparisons where MOS confidence intervals overlap. Through comprehensive experiments across various domains including speech and image quality assessment, we demonstrate that CCI provides a more robust and accurate evaluation of instrumental quality models, especially in scenarios of low sample sizes, rater group variability, and restriction of range. Our findings suggest that incorporating rater subjectivity and focusing on statistically significant pairs can significantly enhance the evaluation framework for multimedia quality prediction models. This work not only sheds light on the overlooked aspects of subjective rating uncertainties but also proposes a methodological advancement for more reliable and accurate quality model evaluation.
标题:DCF-DS:真实单通道条件下语音识别的数字化和分离深度级联融合
链接:https://arxiv.org/abs/2411.06667
摘要:我们提出了一个用于后端语音识别的单通道深度级联日志化和分离融合(DCF-DS)框架,结合了神经说话人日志化(NSD)和语音分离(SS)。首先,我们顺序地将NSD和SS模块集成在一个联合训练框架内,使分离模块能够有效地利用日记模块的说话人时间边界。然后,为了补充DCF-DS训练,我们引入了一个窗口级解码方案,该方案允许DCF-DS框架处理稀疏数据收敛不稳定(SDCI)问题。我们还探索使用在真实数据集上训练的NSD系统,以在解码过程中提供更准确的说话人边界。此外,我们将一个可选的多输入多输出语音增强模块(MIMO-SE)内的DCF-DS框架,它提供了进一步的性能增益。最后,我们通过重新聚类DCF-DS输出来增强日志化结果,提高ASR准确性。通过结合DCF-DS方法,我们在CHiME-8 NOTSOFAR-1挑战赛的现实单通道赛道中获得了第一名。我们还对开放的LibriCSS数据集进行了评估,实现了单通道语音识别的最新性能。
摘要:We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art performance on single-channel speech recognition.
标题:迪夫-MTBC:Cubase的混合风格传输原型
链接:https://arxiv.org/abs/2411.06576
备注:Presented at 2024 International Society for Music Information Retrieval
摘要:在我们的演示中,参与者被邀请探索Diff-MSTC原型,该原型将Diff-MST模型集成到Steinberg的数字音频工作站(DTS)Cubase中。Diff-MST是一种用于混音风格转换的深度学习模型,它使用参考歌曲预测曲目的混音控制台参数。该系统处理多达20个原始曲目以及参考歌曲,以预测可用于创建初始混音的混音控制台参数。用户可以选择进一步手动调整这些参数以获得更好的控制。与早期仅限于研究思路的深度学习系统相比,Diff-MSTC是首个集成到机器人中的原型。这种集成有助于多轨混音决策,并允许用户通过参考歌曲输入上下文,然后以传统方式微调音频效果。
摘要:In our demo, participants are invited to explore the Diff-MSTC prototype, which integrates the Diff-MST model into Steinberg's digital audio workstation (DAW), Cubase. Diff-MST, a deep learning model for mixing style transfer, forecasts mixing console parameters for tracks using a reference song. The system processes up to 20 raw tracks along with a reference song to predict mixing console parameters that can be used to create an initial mix. Users have the option to manually adjust these parameters further for greater control. In contrast to earlier deep learning systems that are limited to research ideas, Diff-MSTC is a first-of-its-kind prototype integrated into a DAW. This integration facilitates mixing decisions on multitracks and lets users input context through a reference song, followed by fine-tuning of audio effects in a traditional manner.
标题:PSELDNets:在大规模合成数据集中预训练的神经网络,用于声音事件定位和检测
链接:https://arxiv.org/abs/2411.06399
备注:13 pages, 9 figures. The code is available at this https URL
摘要:声音事件定位和检测(SELD)通过基于学习的方法已经取得了实质性的进展。这些系统通常在特定数据集上从头开始训练,已经显示出相当大的泛化能力。最近,在大规模数据集上训练的深度神经网络在声音事件分类(SEC)领域取得了显着成功,这引发了一个悬而未决的问题:这些进步是否可以扩展到开发通用SELD模型。在本文中,利用预训练SEC模型的强大功能,我们在大规模合成数据集上提出了预训练SELD网络(PSELDNets)。这些合成数据集通过将声音事件与模拟空间房间脉冲响应(SRIR)进行卷积而生成,包含1,167小时的音频剪辑,具有170个声音类的本体。这些PSELDNet被转移到下游SELD任务。当我们将PSELDNets适应特定场景时,特别是在低资源数据情况下,我们引入了一种数据高效的微调方法AdapterBit。PSELDNet的综合测试集上使用收集的SRIR从TAU空间房间脉冲响应数据库(TAU-SRIR DB)进行评估,并取得令人满意的性能。我们还进行了实验,以验证PSELDNets到三个公开数据集和我们自己收集的音频记录的可转移性。结果表明,PSELDNets在所有公开数据集上都超过了最先进的系统。考虑到对到达方向估计的需要,SELD通常依赖于足够的多声道音频剪辑。然而,通过整合AdapterBit,PSELDNets使用最少的多声道甚至仅使用单声道音频片段对各种任务表现出更有效的适应性,优于传统的微调方法。
摘要:Sound event localization and detection (SELD) has seen substantial advancements through learning-based methods. These systems, typically trained from scratch on specific datasets, have shown considerable generalization capabilities. Recently, deep neural networks trained on large-scale datasets have achieved remarkable success in the sound event classification (SEC) field, prompting an open question of whether these advancements can be extended to develop general-purpose SELD models. In this paper, leveraging the power of pre-trained SEC models, we propose pre-trained SELD networks (PSELDNets) on large-scale synthetic datasets. These synthetic datasets, generated by convolving sound events with simulated spatial room impulse responses (SRIRs), contain 1,167 hours of audio clips with an ontology of 170 sound classes. These PSELDNets are transferred to downstream SELD tasks. When we adapt PSELDNets to specific scenarios, particularly in low-resource data cases, we introduce a data-efficient fine-tuning method, AdapterBit. PSELDNets are evaluated on a synthetic-test-set using collected SRIRs from TAU Spatial Room Impulse Response Database (TAU-SRIR DB) and achieve satisfactory performance. We also conduct our experiments to validate the transferability of PSELDNets to three publicly available datasets and our own collected audio recordings. Results demonstrate that PSELDNets surpass state-of-the-art systems across all publicly available datasets. Given the need for direction-of-arrival estimation, SELD generally relies on sufficient multi-channel audio clips. However, incorporating the AdapterBit, PSELDNets show more efficient adaptability to various tasks using minimal multi-channel or even just monophonic audio clips, outperforming the traditional fine-tuning approaches.
标题:DCF-DS:真实单通道条件下语音识别的数字化和分离深度级联融合
链接:https://arxiv.org/abs/2411.06667
摘要:我们提出了一个用于后端语音识别的单通道深度级联日志化和分离融合(DCF-DS)框架,结合了神经说话人日志化(NSD)和语音分离(SS)。首先,我们顺序地将NSD和SS模块集成在一个联合训练框架内,使分离模块能够有效地利用日记模块的说话人时间边界。然后,为了补充DCF-DS训练,我们引入了一个窗口级解码方案,该方案允许DCF-DS框架处理稀疏数据收敛不稳定(SDCI)问题。我们还探索使用在真实数据集上训练的NSD系统,以在解码过程中提供更准确的说话人边界。此外,我们将一个可选的多输入多输出语音增强模块(MIMO-SE)内的DCF-DS框架,它提供了进一步的性能增益。最后,我们通过重新聚类DCF-DS输出来增强日志化结果,提高ASR准确性。通过结合DCF-DS方法,我们在CHiME-8 NOTSOFAR-1挑战赛的现实单通道赛道中获得了第一名。我们还对开放的LibriCSS数据集进行了评估,实现了单通道语音识别的最新性能。
摘要:We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art performance on single-channel speech recognition.
标题:迪夫-MTBC:Cubase的混合风格传输原型
链接:https://arxiv.org/abs/2411.06576
备注:Presented at 2024 International Society for Music Information Retrieval
摘要:在我们的演示中,参与者被邀请探索Diff-MSTC原型,该原型将Diff-MST模型集成到Steinberg的数字音频工作站(DTS)Cubase中。Diff-MST是一种用于混音风格转换的深度学习模型,它使用参考歌曲预测曲目的混音控制台参数。该系统处理多达20个原始曲目以及参考歌曲,以预测可用于创建初始混音的混音控制台参数。用户可以选择进一步手动调整这些参数以获得更好的控制。与早期仅限于研究思路的深度学习系统相比,Diff-MSTC是首个集成到机器人中的原型。这种集成有助于多轨混音决策,并允许用户通过参考歌曲输入上下文,然后以传统方式微调音频效果。
摘要:In our demo, participants are invited to explore the Diff-MSTC prototype, which integrates the Diff-MST model into Steinberg's digital audio workstation (DAW), Cubase. Diff-MST, a deep learning model for mixing style transfer, forecasts mixing console parameters for tracks using a reference song. The system processes up to 20 raw tracks along with a reference song to predict mixing console parameters that can be used to create an initial mix. Users have the option to manually adjust these parameters further for greater control. In contrast to earlier deep learning systems that are limited to research ideas, Diff-MSTC is a first-of-its-kind prototype integrated into a DAW. This integration facilitates mixing decisions on multitracks and lets users input context through a reference song, followed by fine-tuning of audio effects in a traditional manner.
标题:辩论:Zero-Shot辩论文本到语音合成
链接:https://arxiv.org/abs/2411.06540
摘要:在辩论中,反驳是最关键的阶段之一,演讲者在这里解决对方提出的论点。在这个过程中,说话人根据对方的语境,综合出自己的说服性表达。本文提出了一种新颖的zero-shot反驳文语合成系统Debatts。辩论需要两个语音提示,一个来自对方(即对手),一个来自演讲者。对手的提示提供辩论风格的韵律,说话人的提示提供身份信息。特别是,我们从野外数据集预训练Debatts系统,并集成了一个额外的参考编码器来为风格提供辩论提示。此外,我们还创建了一个辩论数据集来开发辩论。在这种情况下,Debatts可以生成一个辩论风格的演讲,以反驳任何声音。实验结果证实了该系统的有效性与经典的zero-shot TTS系统相比。
摘要:In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opposing side (i.e. opponent) and one from the speaker. The prompt from the opponent is supposed to provide debating style prosody, and the prompt from the speaker provides identity information. In particular, we pretrain the Debatts system from in-the-wild dataset, and integrate an additional reference encoder to take debating prompt for style. In addition, we also create a debating dataset to develop Debatts. In this setting, Debatts can generate a debating-style speech in rebuttal for any voices. Experimental results confirm the effectiveness of the proposed system in comparison with the classic zero-shot TTS systems.
标题:CDC辅助的基于LLM的上下文ASB
链接:https://arxiv.org/abs/2411.06437
备注:SLT 2024
摘要:上下文ASR或热词定制具有很大的实用价值。尽管当前端到端(E2 E)自动语音识别(ASR)系统的性能令人印象深刻,但它们在准确识别稀有单词方面经常面临挑战。典型的E2 E上下文ASR模型通常具有复杂的架构和解码机制,性能有限,并且容易受到干扰词的干扰。随着基于大语言模型(LLM)的ASR模型成为新的主流,我们提出了一个CTC辅助LLM的上下文ASR模型与一个有效的过滤算法。通过使用粗CTC解码结果来过滤潜在的相关热词并将其纳入LLM提示输入,我们的模型在针对识别罕见长尾词的Libripeech测试-clean和测试-other集上获得了1.27%/3.67%和2.72%/8.02%的WER/B-WER,与基线基于LLM的ASR模型相比表现出显着的改进,并大大超过了其他相关工作。更值得注意的是,在大语言模型和提出的过滤算法的帮助下,我们的上下文ASR模型仍然表现良好,2000偏见的话。
摘要:Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference from distractor words. With large language model (LLM)-based ASR models emerging as the new mainstream, we propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm. By using coarse CTC decoding results to filter potential relevant hotwords and incorporating them into LLM prompt input, our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work. More remarkably, with the help of the large language model and proposed filtering algorithm, our contextual ASR model still performs well with 2000 biasing words.
标题:PSELDNets:在大规模合成数据集中预训练的神经网络,用于声音事件定位和检测
链接:https://arxiv.org/abs/2411.06399
备注:13 pages, 9 figures. The code is available at this https URL
摘要:声音事件定位和检测(SELD)通过基于学习的方法已经取得了实质性的进展。这些系统通常在特定数据集上从头开始训练,已经显示出相当大的泛化能力。最近,在大规模数据集上训练的深度神经网络在声音事件分类(SEC)领域取得了显着的成功,这引发了一个悬而未决的问题,即这些进步是否可以扩展到开发通用SELD模型。在本文中,利用预训练SEC模型的强大功能,我们在大规模合成数据集上提出了预训练SELD网络(PSELDNets)。这些合成数据集通过将声音事件与模拟空间房间脉冲响应(SRIR)进行卷积而生成,包含1,167小时的音频剪辑,具有170个声音类的本体。这些PSELDNet被转移到下游SELD任务。当我们将PSELDNets适应特定场景时,特别是在低资源数据情况下,我们引入了一种数据高效的微调方法AdapterBit。PSELDNet的综合测试集上使用收集的SRIR从TAU空间房间脉冲响应数据库(TAU-SRIR DB)进行评估,并取得令人满意的性能。我们还进行了实验,以验证PSELDNets到三个公开数据集和我们自己收集的音频记录的可转移性。结果表明,PSELDNets在所有公开数据集上都超过了最先进的系统。考虑到对到达方向估计的需要,SELD通常依赖于足够的多声道音频剪辑。然而,通过整合AdapterBit,PSELDNets使用最少的多声道甚至仅使用单声道音频片段对各种任务表现出更有效的适应性,优于传统的微调方法。
摘要:Sound event localization and detection (SELD) has seen substantial advancements through learning-based methods. These systems, typically trained from scratch on specific datasets, have shown considerable generalization capabilities. Recently, deep neural networks trained on large-scale datasets have achieved remarkable success in the sound event classification (SEC) field, prompting an open question of whether these advancements can be extended to develop general-purpose SELD models. In this paper, leveraging the power of pre-trained SEC models, we propose pre-trained SELD networks (PSELDNets) on large-scale synthetic datasets. These synthetic datasets, generated by convolving sound events with simulated spatial room impulse responses (SRIRs), contain 1,167 hours of audio clips with an ontology of 170 sound classes. These PSELDNets are transferred to downstream SELD tasks. When we adapt PSELDNets to specific scenarios, particularly in low-resource data cases, we introduce a data-efficient fine-tuning method, AdapterBit. PSELDNets are evaluated on a synthetic-test-set using collected SRIRs from TAU Spatial Room Impulse Response Database (TAU-SRIR DB) and achieve satisfactory performance. We also conduct our experiments to validate the transferability of PSELDNets to three publicly available datasets and our own collected audio recordings. Results demonstrate that PSELDNets surpass state-of-the-art systems across all publicly available datasets. Given the need for direction-of-arrival estimation, SELD generally relies on sufficient multi-channel audio clips. However, incorporating the AdapterBit, PSELDNets show more efficient adaptability to various tasks using minimal multi-channel or even just monophonic audio clips, outperforming the traditional fine-tuning approaches.
标题:单耳语音增强的选择性状态空间模型
链接:https://arxiv.org/abs/2411.06217
备注:Submitted to IEEE TCE
摘要:语音用户界面(VUI)通过口头命令促进了人类和机器之间的有效交互。由于真实世界的声学场景是复杂的,语音增强起着关键作用,鲁棒的VUI。Transformer及其变体(如Conformer)已在语音增强方面展示了最先进的结果。然而,这两种算法的计算复杂度都是序列长度的二次方,这就限制了它们处理长序列的能力。最近,一种新的状态空间模型称为曼巴,它显示出强大的能力,以处理长序列的线性复杂性,提供了一个解决方案,以解决这一挑战。在本文中,我们提出了一种新的混合卷积Mamba骨干,表示为MambaDC,语音增强。我们的MambaDC结合了卷积网络的优点,可以对本地交互进行建模,并结合了Mamba对长期全局依赖关系进行建模的能力。我们在两个常用的训练目标上,在基本的和最先进的(SoTA)语音增强框架内进行了全面的实验。结果表明,MambaDC在所有训练目标上都优于Transformer、Conformer和标准Mamba。MambaDC主干构建在当前的高级框架之上,与现有的\textcolor{black}{SoTA}系统相比,它的使用展示了更好的结果。这为语音增强中的有效远程全局建模奠定了基础。
摘要:Voice user interfaces (VUIs) have facilitated the efficient interactions between humans and machines through spoken commands. Since real-word acoustic scenes are complex, speech enhancement plays a critical role for robust VUI. Transformer and its variants, such as Conformer, have demonstrated cutting-edge results in speech enhancement. However, both of them suffers from the quadratic computational complexity with respect to the sequence length, which hampers their ability to handle long sequences. Recently a novel State Space Model called Mamba, which shows strong capability to handle long sequences with linear complexity, offers a solution to address this challenge. In this paper, we propose a novel hybrid convolution-Mamba backbone, denoted as MambaDC, for speech enhancement. Our MambaDC marries the benefits of convolutional networks to model the local interactions and Mamba's ability for modeling long-range global dependencies. We conduct comprehensive experiments within both basic and state-of-the-art (SoTA) speech enhancement frameworks, on two commonly used training targets. The results demonstrate that MambaDC outperforms Transformer, Conformer, and the standard Mamba across all training targets. Built upon the current advanced framework, the use of MambaDC backbone showcases superior results compared to existing \textcolor{black}{SoTA} systems. This sets the stage for efficient long-range global modeling in speech enhancement.
标题:使用特征融合基于语音的精神分裂症严重程度估计
链接:https://arxiv.org/abs/2411.06033
备注:Submitted for SPADE workshop at ICASSP 2025
摘要:基于语音的精神分裂症谱的评估在最近的过去已经被广泛研究。在这项研究中,我们开发了一个深度学习框架,使用特征融合方法从语音中估计精神分裂症的严重程度分数,该方法将发音特征与从预训练的音频模型中提取的不同自监督语音特征融合在一起。我们还提出了一个基于自动编码器的自监督表示学习框架,从语音中提取紧凑的发音嵌入。与之前结合语音和视频输入的模型相比,我们的具有多头注意力(MHA)的基于语音的最佳融合模型将平均绝对误差(MAE)降低了9.18%,将均方根误差(RMSE)降低了9.36%。
摘要:Speech-based assessment of the schizophrenia spectrum has been widely researched over in the recent past. In this study, we develop a deep learning framework to estimate schizophrenia severity scores from speech using a feature fusion approach that fuses articulatory features with different self-supervised speech features extracted from pre-trained audio models. We also propose an auto-encoder-based self-supervised representation learning framework to extract compact articulatory embeddings from speech. Our top-performing speech-based fusion model with Multi-Head Attention (MHA) reduces Mean Absolute Error (MAE) by 9.18% and Root Mean Squared Error (RMSE) by 9.36% for schizophrenia severity estimation when compared with the previous models that combined speech and video inputs.
标题:音乐合奏同步的卡尔曼过滤模型
链接:https://arxiv.org/abs/2411.05971
备注:7 pages, 1 figure. Accepted for publication on the 25th International Society for Music Information Retrieval (ISMIR 2024)
摘要:对有节奏的听觉线索的运动反应的同步是在各种物种中观察到的基本生物现象。虽然时间对齐的重要性在不同的背景下有所不同,但实现精确的时间同步是音乐表演中的一个突出目标。音乐家通常会将表现性的时间变化,这需要精确控制时间和同步,特别是在合奏表演中。这是至关重要的,因为故意表达的细微差别和意外的时间偏差都可能影响表演的整体时间。这一讨论提出了音乐家如何调整他们的时间动态,以实现合奏同步的问题。本文介绍了一种新的反馈校正模型的基础上卡尔曼滤波器,旨在提高理解的人际计时合奏音乐表演。该模型的性能类似于文献中的其他线性校正模型,具有计算成本低和性能良好的优点,即使在底层节奏变化的情况下。
摘要:The synchronization of motor responses to rhythmic auditory cues is a fundamental biological phenomenon observed across various species. While the importance of temporal alignment varies across different contexts, achieving precise temporal synchronization is a prominent goal in musical performances. Musicians often incorporate expressive timing variations, which require precise control over timing and synchronization, particularly in ensemble performance. This is crucial because both deliberate expressive nuances and accidental timing deviations can affect the overall timing of a performance. This discussion prompts the question of how musicians adjust their temporal dynamics to achieve synchronization within an ensemble. This paper introduces a novel feedback correction model based on the Kalman Filter, aimed at improving the understanding of interpersonal timing in ensemble music performances. The proposed model performs similarly to other linear correction models in the literature, with the advantage of low computational cost and good performance even in scenarios where the underlying tempo varies.
标题:结合角膜图和视觉变形仪的外来声音分类
链接:https://arxiv.org/abs/2411.05955
备注:None
摘要:早期识别呼吸异常对于改善肺部健康和降低全球死亡率至关重要。呼吸音的分析在表征呼吸系统的状况和识别异常方面起着重要作用。本研究的主要贡献是调查的性能时,输入数据,耳蜗图表示,用于饲料的Vision Transformer架构,因为这种输入分类器组合是第一次它已被应用到偶发的声音分类,以我们的知识。尽管ViT通过将自我注意力应用于声谱图补丁,在音频分类任务中显示出了有希望的结果,但我们通过应用耳蜗图来扩展这种方法,它捕获偶然声音的特定谱时特征。所提出的方法进行评估的ICBHI数据集。我们比较了ViT与其他最先进的CNN方法的分类性能,使用频谱图,梅尔频率倒谱系数,常数Q变换和耳蜗图作为输入数据。我们的研究结果证实了优越的分类性能相结合的耳蜗图和ViT,突出了可靠的呼吸音分类的潜力ViT。这项研究有助于开发自动智能技术,旨在显着提高呼吸系统疾病检测的速度和有效性,从而满足医疗领域的关键需求。
摘要:Early identification of respiratory irregularities is critical for improving lung health and reducing global mortality rates. The analysis of respiratory sounds plays a significant role in characterizing the respiratory system's condition and identifying abnormalities. The main contribution of this study is to investigate the performance when the input data, represented by cochleogram, is used to feed the Vision Transformer architecture, since this input classifier combination is the first time it has been applied to adventitious sound classification to our knowledge. Although ViT has shown promising results in audio classification tasks by applying self attention to spectrogram patches, we extend this approach by applying the cochleogram, which captures specific spectro-temporal features of adventitious sounds. The proposed methodology is evaluated on the ICBHI dataset. We compare the classification performance of ViT with other state of the art CNN approaches using spectrogram, Mel frequency cepstral coefficients, constant Q transform, and cochleogram as input data. Our results confirm the superior classification performance combining cochleogram and ViT, highlighting the potential of ViT for reliable respiratory sound classification. This study contributes to the ongoing efforts in developing automatic intelligent techniques with the aim to significantly augment the speed and effectiveness of respiratory disease detection, thereby addressing a critical need in the medical field.
标题:NatureLM-audio:生物声学音频语言基金会模型
链接:https://arxiv.org/abs/2411.07186
备注:Demo page: this https URL The code will be open-sourced and available shortly
摘要:用文本和音频提示的大型语言模型(LLM)代表了各种听觉任务的最新技术水平,包括语音,音乐和一般音频,在看不见的任务上显示出紧急能力。然而,这些能力尚未在生物声学任务中得到充分证明,例如检测大型录音中的动物发声,对稀有和濒危物种进行分类,以及标记上下文和行为-这些任务对于保护,生物多样性监测和动物行为研究至关重要。在这项工作中,我们提出了NatureLM音频,第一个音频语言基础模型专门为生物声学设计。我们精心策划的训练数据集包括跨越各种生物声学,语音和音乐数据的文本音频对,旨在解决该领域有限的注释数据集所带来的挑战。我们展示了成功的转让学习表示从音乐和语音生物声学,我们的模型显示出有前途的推广看不见的类群和任务。重要的是,我们在一个新的基准(BEANS-Zero)上测试了NatureLM音频,它在几个生物声学任务上设置了新的最先进的技术(SotA),包括对看不见的物种的zero-shot分类。为了推进生物声学研究,我们还开源了生成训练和基准数据以及训练模型的代码。
摘要:Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior - tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.
标题:建立台湾普通话口语模型:首次尝试
链接:https://arxiv.org/abs/2411.07111
备注:Work in progress
摘要:本技术报告介绍了我们为台湾普通话构建口语大语言模型(LLM)的初步尝试,该模型专门针对多轮对话中的实时语音到语音交互。我们的端到端模型采用了仅解码器的Transformer架构,旨在实现无缝交互,同时保留会话流,包括允许同时说话和倾听的全双工功能。本文还详细介绍了培训过程,包括数据准备与合成对话和实时交互的调整。我们还开发了一个平台来评估多轮对话中的会话流畅性和响应一致性。我们希望这份报告的发布能为台湾普通话口语LLM的未来发展做出贡献。
摘要:This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.
标题:基于Mamba的纯解码器、双向语音建模的语音识别方法
链接:https://arxiv.org/abs/2411.06968
备注:Accepted to SLT 2024
摘要:以Mamba为代表的选择性状态空间模型(SSM)在包括自动语音识别(ASR)在内的各种任务中已经证明了它们的计算效率和有希望的结果。在基于注意力的编码器-解码器框架下,Mamba被应用于ASR任务,其中编码器和解码器之间的交叉注意机制仍然存在。本文探讨了Mamba作为ASR任务中的仅解码器架构的能力。我们基于MAmba的解码器ONly方法(MADEON)由一个解码器组成,该解码器以语音标记为条件,并以自回归方式预测文本标记。为了增强MADEON,我们进一步提出了语音前缀,对语音标记进行双向处理,从而丰富了隐藏状态中的上下文信息。我们的实验表明,MADEON的性能显着优于非选择性SSM。语音前缀和最近提出的Mamba-2的组合在大型数据集上产生了与基于Transformer的模型相当的性能。
摘要:Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on large datasets.
标题:基于脑电波的声音空间谱多类解码出席者方向
链接:https://arxiv.org/abs/2411.06928
摘要:从听众的脑电图(EEG)信号解码出演讲者的方向性焦点对于开发脑机接口以改善听力障碍患者的生活质量至关重要。先前的工作集中于二进制方向聚焦解码,即,确定所述被关注的说话者是在所述收听者的左侧还是右侧。然而,对于有效的语音处理来说,需要更精确地解码说话者的确切方向。另外,音频空间信息没有被有效地利用,导致次优的解码结果。在本文中,我们观察到,在我们最近提出的具有15类方向性焦点的数据集上,完全依赖于EEG输入的模型在Leave-one-subject-out和Leave-one-trial-out场景中解码方向性焦点时表现出显着较低的准确性。通过将音频空间谱与脑电特征相结合,可以有效地提高解码精度。我们采用CNN,LSM-CNN,和EEG变形器模型解码的方向性焦点从听众的EEG信号与辅助音频空间频谱。所提出的Sp-Aux-Deformer模型在leave-one-subject-out和leave-one-trial-out场景中分别实现了57.48%和61.83%的15类解码准确率。
摘要:Decoding the directional focus of an attended speaker from listeners' electroencephalogram (EEG) signals is essential for developing brain-computer interfaces to improve the quality of life for individuals with hearing impairment. Previous works have concentrated on binary directional focus decoding, i.e., determining whether the attended speaker is on the left or right side of the listener. However, a more precise decoding of the exact direction of the attended speaker is necessary for effective speech processing. Additionally, audio spatial information has not been effectively leveraged, resulting in suboptimal decoding results. In this paper, we observe that, on our recently presented dataset with 15-class directional focus, models relying exclusively on EEG inputs exhibits significantly lower accuracy when decoding the directional focus in both leave-one-subject-out and leave-one-trial-out scenarios. By integrating audio spatial spectra with EEG features, the decoding accuracy can be effectively improved. We employ the CNN, LSM-CNN, and EEG-Deformer models to decode the directional focus from listeners' EEG signals with the auxiliary audio spatial spectra. The proposed Sp-Aux-Deformer model achieves notable 15-class decoding accuracies of 57.48% and 61.83% in leave-one-subject-out and leave-one-trial-out scenarios, respectively.
标题:Wavehax:基于2D卷积和调和先验的无混叠神经波合成,用于可靠的复谱图估计
链接:https://arxiv.org/abs/2411.06807
备注:13 pages, 5 figures, Submitted to IEEE/ACM Trans. ASLP
摘要:神经声码器经常与潜在特征空间中的混叠作斗争,这是由时域非线性操作和恢复层引起的。混叠将高频分量折叠到低频范围中,使得混叠的频率分量和原始频率分量无法区分,并引入两个实际问题。首先,混叠使波形生成过程复杂化,因为后续层必须解决这些混叠效应,从而增加了计算复杂性。其次,它限制了外推性能,特别是在处理高基频时,这降低了所生成的语音波形的感知质量。本文证明了1)时域非线性操作不可避免地引入混叠,但为谐波生成提供了强的电感偏置,以及2)时频域处理可以实现无混叠波形合成,但缺乏有效谐波生成的电感偏置。基于这一见解,我们提出了Wavehax,一种无混叠的神经波形生成器,它集成了2D卷积和HARmonic先验,用于可靠的复杂频谱图估计。实验结果表明,Wavehax实现了与现有高保真神经声码器相当的语音质量,并在需要高基频外推的情况下表现出出色的鲁棒性,其中混叠效应通常变得严重。此外,与HiFi-GAN V1相比,Wavehax只需要不到5%的乘法累加运算和模型参数,同时实现了超过四倍的CPU推理速度。
摘要:Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.
标题:神经脉冲响应场的声学体绘制
链接:https://arxiv.org/abs/2411.06307
备注:NeurIPS 2024 Spotlight
摘要:捕捉准确声学现象的逼真音频合成对于在虚拟和增强现实中创建沉浸式体验至关重要。合成在任何位置接收的声音依赖于对脉冲响应(IR)的估计,其表征声音在到达收听者的位置之前如何在一个场景中沿着不同路径传播。在本文中,我们提出了声学体绘制(AVR),一种新的方法,适应体绘制技术来模拟声脉冲响应。虽然体绘制已经成功地为图像和神经场景表示建模辐射场,但IR作为时间序列信号提出了独特的挑战。为了解决这些挑战,我们引入频域体绘制,并使用球面积分来适应IR测量。我们的方法构建了一个脉冲响应字段,该字段固有地编码波传播原理,并在合成新姿势的脉冲响应方面实现了最先进的性能。实验表明,AVR大大优于目前的领先方法。此外,我们开发了一个声学仿真平台AcoustiX,它提供比现有仿真器更准确、更真实的红外仿真。AVR和AcoustiX的代码可在https://zitonglan.github.io/avr上获得。
摘要:Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acoustic Volume Rendering (AVR), a novel approach that adapts volume rendering techniques to model acoustic impulse responses. While volume rendering has been successful in modeling radiance fields for images and neural scene representations, IRs present unique challenges as time-series signals. To address these challenges, we introduce frequency-domain volume rendering and use spherical integration to fit the IR measurements. Our method constructs an impulse response field that inherently encodes wave propagation principles and achieves state-of-the-art performance in synthesizing impulse responses for novel poses. Experiments show that AVR surpasses current leading methods by a substantial margin. Additionally, we develop an acoustic simulation platform, AcoustiX, which provides more accurate and realistic IR simulations than existing simulators. Code for AVR and AcoustiX are available at https://zitonglan.github.io/avr.
标题:低频低位深信号类型和严重度的智能故障诊断
链接:https://arxiv.org/abs/2411.06299
摘要:本研究的重点是智能故障诊断(IFD)在旋转机械利用一个单一的麦克风和数据驱动的方法,有效地诊断42类故障类型和严重程度。该研究利用了来自不平衡MaFaulDa数据集的可靠数据,旨在在高性能和低资源消耗之间取得平衡。测试阶段包括各种配置,包括采样,量化,信号归一化,静音消除,维纳滤波,数据缩放,开窗,增强和分类器调整使用XGBoost。通过对时间、频率、梅尔频率和统计特征的分析,我们在8 kHz、8位配置下仅用6棵提升树就实现了99.54%的准确率和99.52%的F-Beta分数。此外,当仅利用MFCC及其一阶和二阶增量时,我们记录的准确度为97.83%,F-Beta评分为97.67%。最后,通过实施一个贪婪的包装方法,我们获得了显着的准确率为96.82%和F-Beta得分为98.86%,使用50个选定的功能,几乎所有这些都是第一和第二阶三角洲的MFCC。
摘要:This study focuses on Intelligent Fault Diagnosis (IFD) in rotating machinery utilizing a single microphone and a data-driven methodology, effectively diagnosing 42 classes of fault types and severities. The research leverages sound data from the imbalanced MaFaulDa dataset, aiming to strike a balance between high performance and low resource consumption. The testing phase encompassed a variety of configurations, including sampling, quantization, signal normalization, silence removal, Wiener filtering, data scaling, windowing, augmentation, and classifier tuning using XGBoost. Through the analysis of time, frequency, mel-frequency, and statistical features, we achieved an impressive accuracy of 99.54% and an F-Beta score of 99.52% with just 6 boosting trees at an 8 kHz, 8-bit configuration. Moreover, when utilizing only MFCCs along with their first- and second-order deltas, we recorded an accuracy of 97.83% and an F-Beta score of 97.67%. Lastly, by implementing a greedy wrapper approach, we obtained a remarkable accuracy of 96.82% and an F-Beta score of 98.86% using 50 selected features, nearly all of which were first- and second-order deltas of the MFCCs.
标题:迈向音频Deepfake识别的跨学科方法
链接:https://arxiv.org/abs/2411.05969
摘要:这一观点呼吁跨学科的学者通过人工智能方法和语言学的跨学科视角来解决音频深度伪造检测和识别的挑战。一方面,有大量的工具可以生成听起来很逼真的假语音,另一方面,对deepfakes的检测却滞后了。特别阻碍音频deepfake检测的是,目前的人工智能模型缺乏对语言固有可变性以及人类语音复杂性和独特性的充分理解。我们看到了最近跨学科工作的巨大潜力,这些工作将语言知识融入人工智能方法中,为专家在环提供途径,并超越基于专家不可知的人工智能方法,以实现更强大、更全面的deepfake检测。
摘要:This perspective calls for scholars across disciplines to address the challenge of audio deepfake detection and discernment through an interdisciplinary lens across Artificial Intelligence methods and linguistics. With an avalanche of tools for the generation of realistic-sounding fake speech on one side, the detection of deepfakes is lagging on the other. Particularly hindering audio deepfake detection is the fact that current AI models lack a full understanding of the inherent variability of language and the complexities and uniqueness of human speech. We see the promising potential in recent transdisciplinary work that incorporates linguistic knowledge into AI approaches to provide pathways for expert-in-the-loop and to move beyond expert agnostic AI-based methods for more robust and comprehensive deepfake detection.
标题:NeKo:面向任务的专家走向识别后生成纠正大型语言模型
链接:https://arxiv.org/abs/2411.05945
备注:NeKo work has been done in June 2024. NeKo LMs will be open source on this https URL under the MIT license
摘要:构建一个通用的识别后纠错器提出了一个关键问题:我们如何才能最有效地训练一个模型上的领域数据集的大混合?答案在于学习特定于网络的特性,并在一个模型中消化它们的知识。先前的方法通过具有单独的校正语言模型来实现这一点,从而导致参数的显著增加。在这项工作中,我们提出了混合专家作为一种解决方案,强调MoEs不仅仅是一个可扩展性工具。我们提出了一个多任务校正MoE,在那里我们训练专家成为“专家”的语音到文本,语言到文本和视觉到文本数据集,通过学习路由每个数据集的令牌到其映射的专家。在Open ASR Leaderboard上的实验表明,我们通过实现平均相对5.0 $%的WER减少和语音和翻译任务的BLEU分数的大幅提高,探索了一种新的最先进的性能。在zero-shot评估中,NeKo的表现优于GPT-3.5和Claude-Opus,在Hyporecycle基准中相对WER降低了15.5 $%至27.6 $%。NeKo作为一个多任务模型,在语法和OCR后纠正方面表现得很有竞争力。
摘要:Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative $5.0$% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with $15.5$% to $27.6$% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
标题:迈向多模式掌握:4.5B参数真正的多模式小语言模型
链接:https://arxiv.org/abs/2411.05903
摘要:我们提出了一种新的4.5B参数小语言模型,可以处理多种输入和输出方式,包括文本,图像,视频和音频。尽管它的尺寸很小,但该模型在各种任务上都达到了接近最先进的性能,展示了多模态模型解决复杂现实问题的潜力。我们的方法利用语言建模和多任务学习方面的最新进展来创建一个通用且高性能的模型,甚至可以部署用于边缘推理。实验结果表明,该模型在多个基准测试中表现出色,为多模态人工智能的进一步发展铺平了道路。
摘要:We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model's strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.
标题:阿拉伯语语音识别中的方言覆盖和概括
链接:https://arxiv.org/abs/2411.05872
摘要:开发强大的自动语音识别(ASR)系统的阿拉伯语,其特点是其丰富的方言多样性,往往被认为是一种低资源的语言在语音技术,需要有效的策略来管理其复杂性。本研究探讨了影响ASR性能的三个关键因素:方言覆盖率在预训练中的作用,与多方言方法相比,方言特定微调的有效性,以及概括看不见的方言的能力。通过对不同方言组合的广泛实验,我们的研究结果为推进阿拉伯语等多中心语言的ASR系统的发展提供了关键见解。
摘要:Developing robust automatic speech recognition (ASR) systems for Arabic, a language characterized by its rich dialectal diversity and often considered a low-resource language in speech technology, demands effective strategies to manage its complexity. This study explores three critical factors influencing ASR performance: the role of dialectal coverage in pre-training, the effectiveness of dialect-specific fine-tuning compared to a multi-dialectal approach, and the ability to generalize to unseen dialects. Through extensive experiments across different dialect combinations, our findings offer key insights towards advancing the development of ASR systems for pluricentric languages like Arabic.
标题:超越相关性:使用约束一致性指数评估多媒体质量模型
链接:https://arxiv.org/abs/2411.05794
摘要:本研究探讨多媒体质量模型的评估,侧重于主观平均意见分数(MOS)评级由于评级者的不一致性和偏见等因素的固有的不确定性。Pearson相关系数(PCC)、Spearman等级相关系数(SRCC)和Kendall Tau(KTAU)等传统统计指标通常无法考虑这些不确定性,导致模型性能评估不准确。我们介绍了约束一致性指数(CCI),一种新的度量,旨在克服现有的指标的局限性,考虑MOS差异的统计意义,并排除MOS置信区间重叠的比较。通过在包括语音和图像质量评估在内的各个领域的综合实验,我们证明了CCI提供了一个更强大和准确的工具质量模型的评估,特别是在低样本量,评分员群体变异性和范围限制的情况下。我们的研究结果表明,将评分者的主观性和专注于统计上显着的对可以显着提高多媒体质量预测模型的评估框架。这项工作不仅揭示了主观评级不确定性的被忽视的方面,但也提出了更可靠和更准确的质量模型评估的方法进步。
摘要:This study investigates the evaluation of multimedia quality models, focusing on the inherent uncertainties in subjective Mean Opinion Score (MOS) ratings due to factors like rater inconsistency and bias. Traditional statistical measures such as Pearson's Correlation Coefficient (PCC), Spearman's Rank Correlation Coefficient (SRCC), and Kendall's Tau (KTAU) often fail to account for these uncertainties, leading to inaccuracies in model performance assessment. We introduce the Constrained Concordance Index (CCI), a novel metric designed to overcome the limitations of existing metrics by considering the statistical significance of MOS differences and excluding comparisons where MOS confidence intervals overlap. Through comprehensive experiments across various domains including speech and image quality assessment, we demonstrate that CCI provides a more robust and accurate evaluation of instrumental quality models, especially in scenarios of low sample sizes, rater group variability, and restriction of range. Our findings suggest that incorporating rater subjectivity and focusing on statistically significant pairs can significantly enhance the evaluation framework for multimedia quality prediction models. This work not only sheds light on the overlooked aspects of subjective rating uncertainties but also proposes a methodological advancement for more reliable and accurate quality model evaluation.