本文经arXiv每日学术速递授权转载
【1】 A Neural Transformer Framework for Simultaneous Tasks of Segmentation, Classification, and Caller Identification of Marmoset Vocalization
链接:https://arxiv.org/abs/2410.23279
摘要:绒猴是一种高度发声的灵长类动物,已成为研究社会交往行为及其机制的常用动物模型。在语音交流的研究中,了解通话者的身份、通话内容和语音交流是至关重要的。CNN的先前工作已经实现了用于绒猴发声的呼叫分割、分类和呼叫者识别的联合模型。然而,CNN在建模长距离声学模式方面存在局限性; Transformer架构已被证明优于CNN,利用自我注意机制,有效地隔离长距离的信息传播,并捕获绒猴发声的全局结构。我们建议使用Transformer来联合分割和分类绒猴的呼叫,并确定每个发声的呼叫者。
摘要:Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.
标题:通过统计工作流程对齐视听关节表示
链接:https://arxiv.org/abs/2410.23230
摘要:视觉内容和伴随的音频信号自然地形成联合表示以改进视听(AV)相关应用。虽然研究开发了各种AV表征学习框架,但AV数据对齐的重要性通常会被破坏以实现高质量的表征。我们观察到音频信号可能包含背景噪声干扰。此外,在音频和视频流之间可能出现不同步。这些不严格的数据对齐限制了表示质量并降低了应用程序性能。在本文中,我们建议从数据为中心的角度,通过调整音频信号的视觉数据,以提高AV联合表示。我们的对齐是在一个代理的工作流程中进行的,该工作流程由一个名为AVAgent的基于LLM的助手控制。对于每个输入AV数据对,我们的AVAgent使用多模态LLM将音频和视频数据分别转换为语言描述(即,工具使用)。然后,AVAgent判断该配对数据是否对齐良好,并计划在需要时编辑音频信号(即,规划)。音频编辑通过过滤噪声或增强数据的预定义操作来执行。此外,我们使用VLM来评估修改的音频信号如何匹配视觉内容并向AVAgent提供反馈(即,反射)。工具使用、规划和反思步骤循环运作,成为一个代理工作流程,其中音频信号逐渐与视觉内容对齐。为此,现有方法可以通过我们的代理工作流程直接利用对齐的AV数据来改善AV关节表示。实验结果全面展示了所提出的方法在不同的下游任务中对以前的基线的最先进的性能。
摘要:Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.
标题:SoundCollage:自动发现音频数据集中的新类别
链接:https://arxiv.org/abs/2410.23008
备注:5 pages, 2 figures
摘要:开发新的机器学习应用程序通常需要收集新的数据集。然而,现有的数据集可能已经包含了相关的信息,以训练模型用于新的目的。我们建议SoundCollage:一个框架,用于发现音频数据集中的新类别,方法是结合(1)音频预处理管道来分解音频样本中的不同声音,以及(2)基于模型的自动注释机制来识别发现的类别。此外,我们引入清晰度度量来评估发现的类的一致性,以便更好地训练新的下游应用程序。我们的评估表明,下游音频分类器在发现的类样本和保留的数据集中的准确性分别比基线提高了34.7%和4.5%,突出了SoundCollage通过标记新发现的类来使数据集可重用的潜力。为了鼓励该领域的进一步研究,我们在https://github.com/nokia-bell-labs/audio-class-discovery上开源了我们的代码。
摘要:Developing new machine learning applications often requires the collection of new datasets. However, existing datasets may already contain relevant information to train models for new purposes. We propose SoundCollage: a framework to discover new classes within audio datasets by incorporating (1) an audio pre-processing pipeline to decompose different sounds in audio samples and (2) an automated model-based annotation mechanism to identify the discovered classes. Furthermore, we introduce clarity measure to assess the coherence of the discovered classes for better training new downstream applications. Our evaluations show that the accuracy of downstream audio classifiers within discovered class samples and held-out datasets improves over the baseline by up to 34.7% and 4.5%, respectively, highlighting the potential of SoundCollage in making datasets reusable by labeling with newly discovered classes. To encourage further research in this area, we open-source our code at https://github.com/nokia-bell-labs/audio-class-discovery.
标题:通过扩散Transformer改善音乐伴奏联合创作
链接:https://arxiv.org/abs/2410.23005
备注:5 pages; 1 table
摘要:基于Diff-A-Riff,一个潜在的乐器伴奏生成扩散模型,我们提出了一系列的改进,目标质量,多样性,推理速度和文本驱动的控制。首先,我们将底层的自动编码器升级为具有出色保真度的立体声模型,并将潜在的U型网络替换为扩散Transformer。此外,我们通过训练一个跨模态预测网络来改进文本提示,将文本派生的CLAP嵌入转换为音频派生的CLAP嵌入。最后,我们通过使用一致性框架训练潜在模型来提高推理速度,以更少的去噪步骤实现有竞争力的质量。在消融实验中使用客观指标对原始Diff-A-Riff变体进行评估,证明在所有目标领域都有很大的进步。可在https://sonycslparis.github.io/improved_dar/上找到正确的示例。
摘要:Building upon Diff-A-Riff, a latent diffusion model for musical instrument accompaniment generation, we present a series of improvements targeting quality, diversity, inference speed, and text-driven control. First, we upgrade the underlying autoencoder to a stereo-capable model with superior fidelity and replace the latent U-Net with a Diffusion Transformer. Additionally, we refine text prompting by training a cross-modality predictive network to translate text-derived CLAP embeddings to audio-derived CLAP embeddings. Finally, we improve inference speed by training the latent model using a consistency framework, achieving competitive quality with fewer denoising steps. Our model is evaluated against the original Diff-A-Riff variant using objective metrics in ablation experiments, demonstrating promising advancements in all targeted areas. Sound examples are available at: https://sonycslparis.github.io/improved_dar/.
标题:神经束形成的运行时自适应用于鲁棒语音去回响和去噪
链接:https://arxiv.org/abs/2410.22805
备注:Accepted to APSIPA2024
摘要:本文介绍了在实际环境中实时自动语音识别(ASR)的语音增强。完成这项任务的标准方法是使用神经波束成形,它可以以在线方式有效地工作。它使用深度神经网络(DNN)从嘈杂的回声混合频谱图中估计干净的干语音的掩码,然后计算用于波束形成的增强滤波器。然而,这种监督方法的性能在不匹配的条件下急剧下降。这需要对DNN进行运行时调整。虽然在运行时无法获得自适应所需的地面实况语音频谱图,但可以使用盲去混响和分离方法(例如加权预测误差(WPE)和快速多通道非负矩阵分解(FastMNMF))从混合物中生成伪地面实况数据。基于这一思想,以前的工作提出了一个双处理系统的基础上级联的WPE和最小方差无失真响应(MVDR)波束形成异步微调块在线FastMNMF。为了将去混响能力集成到神经波束成形中并使其在运行时可微调,我们建议使用加权功率最小化无失真响应(WPD)波束成形,这是WPE和最小功率无失真响应(MPDR)的统一版本,其联合去混响和去噪滤波器使用DNN进行估计。我们评估了不同数量的扬声器,混响时间和信噪比(SNR)的各种条件下的运行时自适应的影响。
摘要:This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).
标题:用于声音事件定位和检测的DOA感知视听自我监督学习
链接:https://arxiv.org/abs/2410.22803
备注:Accepted to APSIPA2023
摘要:本文介绍了声音事件定位和检测(SELD)的空间音频记录捕获的一阶立体混响(FOA)麦克风。在该任务中,可以使用用声音事件的类别和到达方向(DOA)注释的FOA数据来训练深度神经网络(DNN)。然而,这种方法的性能受到注释数据量的严重限制。为了克服这一限制,我们提出了一种新的方法,以自监督的方式预训练DNN的特征提取部分。我们使用大量的空间视听记录作为虚拟现实内容。假设FOA麦克风和全向摄像机同时观察到声音对象,我们用对比学习联合训练音频和视觉编码器,使得相同记录和DOA的音频和视觉嵌入接近。我们的方法的一个关键特征是,DOA明智的音频嵌入是从原始音频数据中联合提取的,而DOA明智的视觉嵌入是分别从以相应DOA为中心的局部视觉作物中提取的。这鼓励音频编码器的潜在特征表示声音事件的类别和DOA两者。使用20小时的DCASE2022任务3数据集的实验显示,100小时的无注释视听记录将SELD的错误评分从36.4分降低到34.9分。
摘要:This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones. In this task, one may train a deep neural network (DNN) using FOA data annotated with the classes and directions of arrival (DOAs) of sound events. However, the performance of this approach is severely bounded by the amount of annotated data. To overcome this limitation, we propose a novel method of pretraining the feature extraction part of the DNN in a self-supervised manner. We use spatial audio-visual recordings abundantly available as virtual reality contents. Assuming that sound objects are concurrently observed by the FOA microphones and the omni-directional camera, we jointly train audio and visual encoders with contrastive learning such that the audio and visual embeddings of the same recording and DOA are made close. A key feature of our method is that the DOA-wise audio embeddings are jointly extracted from the raw audio data, while the DOA-wise visual embeddings are separately extracted from the local visual crops centered on the corresponding DOA. This encourages the latent features of the audio encoder to represent both the classes and DOAs of sound events. The experiment using the DCASE2022 Task 3 dataset of 20 hours shows non-annotated audio-visual recordings of 100 hours reduced the error score of SELD from 36.4 pts to 34.9 pts.
标题:质量感知的端到端视听神经扬声器扩展
链接:https://arxiv.org/abs/2410.22350
摘要:在本文中,我们提出了一个质量感知的端到端的视听神经说话人日志化框架,其中包括三个关键技术。首先,我们的视听模型将音频和视觉特征作为输入,利用一系列二进制分类输出层来同时识别所有说话者的活动。这种端到端框架经过精心设计,可有效处理重叠语音的情况,通过利用多模态信息,提供语音和非语音片段之间的准确区分。接下来,我们采用质量感知的视听融合结构来解决音频降级(如噪声、混响和其他失真)和视频降级(如遮挡、屏幕外扬声器或不可靠检测)的信号质量问题。最后,应用于多说话者嵌入的交叉注意机制使网络能够处理不同数量说话者的情况。我们的实验结果,从不同的数据集,证明了我们提出的技术在不同的声学环境中的鲁棒性。即使在视频质量严重下降的情况下,我们的系统也能达到与最佳视听系统相当的性能水平。
摘要:In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems.
标题:视听角度和声音不一致不会影响虚拟现实中的视听言语短期记忆
链接:https://arxiv.org/abs/2410.23015
备注:Submitted to PlosOne, 19 pages, 6 figures
摘要:虚拟现实(VR)环境经常用于听觉和认知研究,以模仿现实生活场景,可能会增强传统计算机屏幕的最新方法。然而,不同的显示技术对视听处理的影响仍然没有得到充分的探索。本研究调查了与传统计算机显示器相比,头戴式显示器(HMD)显示的VR如何影响串行回忆性能,重点关注其对认知任务中视听处理的影响。为此,我们进行了两个实验,使用HMD和计算机显示器作为显示设备,并使用两种类型的视听不一致:角度(Exp. 1)声音(Exp。2)不一致为了量化认知性能的视听言语序列回忆(avVSR)任务的开发,其中一个具体的会话代理(ECA)的动画发言的目标数字序列。尽管主观评估显示在HMD条件下存在感更高,但我们发现显示设备对正确回忆数字的比例没有影响。在计算机显示器呈现的角度不一致的极端条件下,正确回忆数字的比例略有增加,可能是由于注意力的提高,但效果可能太小而没有意义。在两个实验中,反应时间不受任何显示设备的不一致性的影响。这些研究结果表明,avVSR任务是强大的角度和语音视听不一致,无论显示设备,至少在这里研究的条件。因此,这项研究介绍了VR中的avVSR任务,有助于理解视听整合。
摘要:Virtual reality (VR) environments are frequently used in auditory and cognitive research to imitate real-life scenarios, presumably enhancing state-of-the-art approaches with traditional computer screens. However, the effects of different display technologies on audiovisual processing remain underexplored. This study investigated how VR displayed with an head-mounted display (HMD) affects serial recall performance compared to traditional computer monitors, focusing on their effects on audiovisual processing in cognitive tasks. For that matter, we conducted two experiments with both an HMD and a computer monitor as display devices and two types of audiovisual incongruences: angle (Exp. 1) and voice (Exp. 2) incongruence. To quantify cognitive performance an audiovisual verbal serial recall (avVSR) task was developed where an embodied conversational agent (ECA) was animated to speak the target digit sequence. Even though subjective evaluations showed a higher sense of presence in the HMD condition, we found no effect of the display device on the proportion of correctly recalled digits. For the extreme conditions of angle incongruence in the computer monitor presentation the proportion of correctly recalled digits increased marginally, presumably due to raised attention, but the effect is likely too small to be meaningful. Response times were not affected by incongruences in either display device across both experiments. These findings suggest that the avVSR task is robust against angular and voice audiovisual incongruences, irrespective of the display device, at least for the conditions studied here. Hence, the study introduces the avVSR task in VR and contributes to the understanding of audiovisual integration.
标题:利用合成数据增强波兰语自动语音识别系统
链接:https://arxiv.org/abs/2410.22903
摘要:本文介绍了一个为提交Poleval 2024,任务3:波兰语自动语音识别挑战而开发的系统。我们描述了基于Voicebox的语音合成流水线,并利用它来增强Conformer和Whisper语音识别模型与合成数据。我们表明,除了合成语音训练提高取得的成果显着。我们还介绍了我们的模型在比赛中取得的最终结果。
摘要:This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.
标题:APCodec+:基于频谱编码的高保真和高压缩率神经音频编解码器,具有分阶段训练范式
链接:https://arxiv.org/abs/2410.22807
备注:Accepted by ISCSLP 2025
摘要:本文提出了一种新的神经音频编解码器,命名为APCodec+,这是一个改进版本的APCodec。APCodec+以音频幅度谱和相位谱为编码对象,采用对抗训练策略。创新性地,我们提出了一个两阶段的联合个人训练范例的APCodec+。在联合训练阶段,编码器、量化器、解码器和解码器在完全频谱损失、量化损失和对抗损失的情况下联合训练。在单独训练阶段,编码器和量化器固定它们的参数,并为解码器和解码器提供高质量的训练数据。解码器和解码器是从头开始单独训练的,没有量化损失。引入个体训练的目的是降低解码器的学习难度,从而进一步提高解码音频的保真度。实验结果证实,我们提出的APCodec+在低比特率下实现了与基线编解码器在较高比特率下相当的性能,这要归功于所提出的分阶段训练范例。
摘要:This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm.
标题:近距离观察神经编解码器再合成:弥合编解码器和波生成之间的差距
链接:https://arxiv.org/abs/2410.22448
备注:NeurIPS 2024 Audio Imagination workshop paper; demo page at this https URL
摘要:神经音频编解码器最初被设计为一种压缩技术,最近在语音生成方面获得了更多的关注。编解码器模型将每个音频帧表示为令牌序列,即,离散嵌入神经编解码器的离散和低频性质引入了一种新的方法来生成基于令牌的模型的语音。由于这些标记以从粗到细的各种粒度级别对信息进行编码,因此大多数现有工作都集中在如何更好地生成粗标记。在本文中,我们关注一个同样重要但经常被忽视的问题:如何更好地从粗糙的令牌重新合成波形?我们指出,学习目标的选择和再合成方法对生成的音频质量有显着的影响。具体地说,我们研究了基于令牌预测和回归的两种不同策略,并介绍了一种基于Schr“odinger桥的新方法.我们研究不同的设计选择如何影响机器和人类的感知。
摘要:Neural Audio Codecs, initially designed as a compression technique, have gained more attention recently for speech generation. Codec models represent each audio frame as a sequence of tokens, i.e., discrete embeddings. The discrete and low-frequency nature of neural codecs introduced a new way to generate speech with token-based models. As these tokens encode information at various levels of granularity, from coarse to fine, most existing works focus on how to better generate the coarse tokens. In this paper, we focus on an equally important but often overlooked question: How can we better resynthesize the waveform from coarse tokens? We point out that both the choice of learning target and resynthesis approach have a dramatic impact on the generated audio quality. Specifically, we study two different strategies based on token prediction and regression, and introduce a new method based on Schr\"odinger Bridge. We examine how different design choices affect machine and human perception.
链接:https://arxiv.org/abs/2410.23015
备注:Submitted to PlosOne, 19 pages, 6 figures
摘要:虚拟现实(VR)环境经常用于听觉和认知研究,以模仿现实生活中的场景,可能会增强传统计算机屏幕的最先进方法。然而,不同的显示技术对视听处理的影响仍然没有得到充分的探索。本研究调查了与传统计算机显示器相比,头戴式显示器(HMD)显示的VR如何影响串行回忆性能,重点关注其对认知任务中视听处理的影响。为此,我们进行了两个实验,使用HMD和计算机显示器作为显示设备,并使用两种类型的视听不一致:角度(Exp. 1)声音(Exp。2)不一致为了量化认知性能的视听言语序列回忆(avVSR)任务的开发,其中一个具体的会话代理(ECA)的动画发言的目标数字序列。尽管主观评估显示在HMD条件下存在感更高,但我们发现显示设备对正确回忆数字的比例没有影响。在计算机显示器呈现的角度不一致的极端条件下,正确回忆数字的比例略有增加,可能是由于注意力的提高,但效果可能太小而没有意义。在两个实验中,反应时间不受任何显示设备的不一致性的影响。这些研究结果表明,avVSR任务是强大的角度和语音视听不一致,无论显示设备,至少在这里研究的条件。因此,这项研究介绍了VR中的avVSR任务,有助于理解视听整合。
摘要:Virtual reality (VR) environments are frequently used in auditory and cognitive research to imitate real-life scenarios, presumably enhancing state-of-the-art approaches with traditional computer screens. However, the effects of different display technologies on audiovisual processing remain underexplored. This study investigated how VR displayed with an head-mounted display (HMD) affects serial recall performance compared to traditional computer monitors, focusing on their effects on audiovisual processing in cognitive tasks. For that matter, we conducted two experiments with both an HMD and a computer monitor as display devices and two types of audiovisual incongruences: angle (Exp. 1) and voice (Exp. 2) incongruence. To quantify cognitive performance an audiovisual verbal serial recall (avVSR) task was developed where an embodied conversational agent (ECA) was animated to speak the target digit sequence. Even though subjective evaluations showed a higher sense of presence in the HMD condition, we found no effect of the display device on the proportion of correctly recalled digits. For the extreme conditions of angle incongruence in the computer monitor presentation the proportion of correctly recalled digits increased marginally, presumably due to raised attention, but the effect is likely too small to be meaningful. Response times were not affected by incongruences in either display device across both experiments. These findings suggest that the avVSR task is robust against angular and voice audiovisual incongruences, irrespective of the display device, at least for the conditions studied here. Hence, the study introduces the avVSR task in VR and contributes to the understanding of audiovisual integration.
标题:利用合成数据增强波兰语自动语音识别系统
链接:https://arxiv.org/abs/2410.22903
摘要:本文介绍了一个为提交Poleval 2024,任务3:波兰语自动语音识别挑战而开发的系统。我们描述了基于Voicebox的语音合成流水线,并利用它来增强Conformer和Whisper语音识别模型与合成数据。我们表明,除了合成语音训练提高取得的成果显着。我们还介绍了我们的模型在比赛中取得的最终结果。
摘要:This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.
标题:APCodec+:基于频谱编码的高保真和高压缩率神经音频编解码器,具有分阶段训练范式
链接:https://arxiv.org/abs/2410.22807
备注:Accepted by ISCSLP 2025
摘要:本文提出了一种新的神经音频编解码器,命名为APCodec+,这是一个改进版本的APCodec。APCodec+以音频幅度谱和相位谱为编码对象,采用对抗训练策略。创新性地,我们提出了一个两阶段的联合个人训练范例的APCodec+。在联合训练阶段,编码器、量化器、解码器和解码器在完全频谱损失、量化损失和对抗损失的情况下联合训练。在单独训练阶段,编码器和量化器固定它们的参数,并为解码器和解码器提供高质量的训练数据。解码器和解码器是从头开始单独训练的,没有量化损失。引入个体训练的目的是降低解码器的学习难度,从而进一步提高解码音频的保真度。实验结果证实,我们提出的APCodec+在低比特率下实现了与基线编解码器在较高比特率下相当的性能,这要归功于所提出的分阶段训练范例。
摘要:This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm.
标题:近距离观察神经编解码器再合成:弥合编解码器和波生成之间的差距
链接:https://arxiv.org/abs/2410.22448
备注:NeurIPS 2024 Audio Imagination workshop paper; demo page at this https URL
摘要:神经音频编解码器最初被设计为一种压缩技术,最近在语音生成方面获得了更多的关注。编解码器模型将每个音频帧表示为令牌序列,即,离散嵌入神经编解码器的离散和低频性质引入了一种新的方法来生成基于令牌的模型的语音。由于这些标记以从粗到细的各种粒度级别对信息进行编码,因此大多数现有工作都集中在如何更好地生成粗标记。在本文中,我们关注一个同样重要但经常被忽视的问题:如何更好地从粗糙的令牌重新合成波形?我们指出,学习目标的选择和再合成方法对生成的音频质量有显着的影响。具体地说,我们研究了基于令牌预测和回归的两种不同策略,并介绍了一种基于Schr "odinger桥的新方法.我们研究不同的设计选择如何影响机器和人类的感知。
摘要:Neural Audio Codecs, initially designed as a compression technique, have gained more attention recently for speech generation. Codec models represent each audio frame as a sequence of tokens, i.e., discrete embeddings. The discrete and low-frequency nature of neural codecs introduced a new way to generate speech with token-based models. As these tokens encode information at various levels of granularity, from coarse to fine, most existing works focus on how to better generate the coarse tokens. In this paper, we focus on an equally important but often overlooked question: How can we better resynthesize the waveform from coarse tokens? We point out that both the choice of learning target and resynthesis approach have a dramatic impact on the generated audio quality. Specifically, we study two different strategies based on token prediction and regression, and introduce a new method based on Schr\"odinger Bridge. We examine how different design choices affect machine and human perception.
标题:用于绒猴发声的同时分割、分类和呼叫者识别任务的神经Transformer框架
链接:https://arxiv.org/abs/2410.23279
摘要:绒猴是一种高度发声的灵长类动物,已成为研究社会交往行为及其机制的常用动物模型。在语音交流的研究中,了解通话者的身份、通话内容和语音交流是至关重要的。CNN的先前工作已经实现了用于绒猴发声的呼叫分割、分类和呼叫者识别的联合模型。然而,CNN在建模长距离声学模式方面存在局限性; Transformer架构已被证明优于CNN,利用自我注意机制,有效地隔离长距离的信息传播,并捕获绒猴发声的全局结构。我们建议使用Transformer来联合分割和分类绒猴的呼叫,并确定每个发声的呼叫者。
摘要:Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.
标题:通过统计工作流程对齐视听关节表示
链接:https://arxiv.org/abs/2410.23230
摘要:视觉内容和伴随的音频信号自然地形成联合表示以改进视听(AV)相关应用。虽然研究开发了各种AV表征学习框架,但AV数据对齐的重要性通常会被破坏以实现高质量的表征。我们观察到音频信号可能包含背景噪声干扰。此外,音频和视频流之间可能会出现不同步。这些不严格的数据对齐限制了表示质量并降低了应用程序性能。在本文中,我们建议从数据为中心的角度,通过调整音频信号的视觉数据,以提高AV联合表示。我们的对齐是在一个代理的工作流程中进行的,该工作流程由一个名为AVAgent的基于LLM的助手控制。对于每个输入AV数据对,我们的AVAgent使用多模态LLM将音频和视觉数据分别转换为语言描述(即,工具使用)。然后,AVAgent判断该配对数据是否对齐良好,并计划在需要时编辑音频信号(即,规划)。音频编辑通过过滤噪声或增强数据的预定义操作来执行。此外,我们使用VLM来评估修改的音频信号如何匹配视觉内容并向AVAgent提供反馈(即,反射)。工具使用、规划和反思步骤循环运作,成为一个代理工作流程,其中音频信号逐渐与视觉内容对齐。为此,现有方法可以通过我们的代理工作流程直接利用对齐的AV数据来改善AV关节表示。实验结果全面展示了所提出的方法在不同的下游任务中对以前的基线的最先进的性能。
摘要:Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.
标题:SoundCollage:自动发现音频数据集中的新类别
链接:https://arxiv.org/abs/2410.23008
备注:5 pages, 2 figures
摘要:开发新的机器学习应用程序通常需要收集新的数据集。然而,现有的数据集可能已经包含了相关的信息,以训练模型用于新的目的。我们建议SoundCollage:一个框架,用于发现音频数据集中的新类别,方法是结合(1)音频预处理管道来分解音频样本中的不同声音,以及(2)基于模型的自动注释机制来识别发现的类别。此外,我们引入清晰度度量来评估发现的类的一致性,以便更好地训练新的下游应用程序。我们的评估表明,下游音频分类器在发现的类样本和保留的数据集中的准确性分别比基线提高了34.7%和4.5%,突出了SoundCollage通过标记新发现的类来使数据集可重用的潜力。为了鼓励这一领域的进一步研究,我们在https://github.com/nokia-bell-labs/audio-class-discovery上开源了我们的代码。
摘要:Developing new machine learning applications often requires the collection of new datasets. However, existing datasets may already contain relevant information to train models for new purposes. We propose SoundCollage: a framework to discover new classes within audio datasets by incorporating (1) an audio pre-processing pipeline to decompose different sounds in audio samples and (2) an automated model-based annotation mechanism to identify the discovered classes. Furthermore, we introduce clarity measure to assess the coherence of the discovered classes for better training new downstream applications. Our evaluations show that the accuracy of downstream audio classifiers within discovered class samples and held-out datasets improves over the baseline by up to 34.7% and 4.5%, respectively, highlighting the potential of SoundCollage in making datasets reusable by labeling with newly discovered classes. To encourage further research in this area, we open-source our code at https://github.com/nokia-bell-labs/audio-class-discovery.
标题:通过扩散Transformer改善音乐伴奏联合创作
链接:https://arxiv.org/abs/2410.23005
备注:5 pages; 1 table
摘要:基于Diff-A-Riff,一个潜在的乐器伴奏生成扩散模型,我们提出了一系列的改进,目标质量,多样性,推理速度和文本驱动的控制。首先,我们将底层的自动编码器升级为具有出色保真度的立体声模型,并将潜在的U型网络替换为扩散Transformer。此外,我们通过训练一个跨模态预测网络来改进文本提示,将文本派生的CLAP嵌入转换为音频派生的CLAP嵌入。最后,我们通过使用一致性框架训练潜在模型来提高推理速度,以更少的去噪步骤实现有竞争力的质量。在消融实验中使用客观指标对原始Diff-A-Riff变体进行评估,证明在所有目标领域都有很大的进步。可在https://sonycslparis.github.io/improved_dar/上找到正确的示例。
摘要:Building upon Diff-A-Riff, a latent diffusion model for musical instrument accompaniment generation, we present a series of improvements targeting quality, diversity, inference speed, and text-driven control. First, we upgrade the underlying autoencoder to a stereo-capable model with superior fidelity and replace the latent U-Net with a Diffusion Transformer. Additionally, we refine text prompting by training a cross-modality predictive network to translate text-derived CLAP embeddings to audio-derived CLAP embeddings. Finally, we improve inference speed by training the latent model using a consistency framework, achieving competitive quality with fewer denoising steps. Our model is evaluated against the original Diff-A-Riff variant using objective metrics in ablation experiments, demonstrating promising advancements in all targeted areas. Sound examples are available at: https://sonycslparis.github.io/improved_dar/.
标题:神经束形成的运行时自适应用于鲁棒语音去回响和去噪
链接:https://arxiv.org/abs/2410.22805
备注:Accepted to APSIPA2024
摘要:本文介绍了在实际环境中实时自动语音识别(ASR)的语音增强。完成这项任务的标准方法是使用神经波束成形,它可以以在线方式有效地工作。它使用深度神经网络(DNN)从嘈杂的回声混合频谱图中估计干净的干语音的掩码,然后计算用于波束形成的增强滤波器。然而,这种监督方法的性能在不匹配的条件下急剧下降。这需要DNN的运行时自适应。虽然在运行时无法获得自适应所需的地面实况语音频谱图,但可以使用盲去混响和分离方法(例如加权预测误差(WPE)和快速多通道非负矩阵分解(FastMNMF))从混合物中生成伪地面实况数据。基于这一思想,以前的工作提出了一个双处理系统的基础上级联的WPE和最小方差无失真响应(MVDR)波束形成异步微调块在线FastMNMF。为了将去混响能力集成到神经波束成形中并使其在运行时可微调,我们建议使用加权功率最小化无失真响应(WPD)波束成形,这是WPE和最小功率无失真响应(MPDR)的统一版本,其联合去混响和去噪滤波器使用DNN进行估计。我们评估了不同数量的扬声器,混响时间和信噪比(SNR)的各种条件下的运行时自适应的影响。
摘要:This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).
标题:用于声音事件定位和检测的DOA感知视听自我监督学习
链接:https://arxiv.org/abs/2410.22803
备注:Accepted to APSIPA2023
摘要:本文介绍了声音事件定位和检测(SELD)的空间音频记录捕获的一阶立体混响(FOA)麦克风。在此任务中,可以使用用声音事件的类别和到达方向(DOA)注释的FOA数据来训练深度神经网络(DNN)。然而,这种方法的性能受到注释数据量的严重限制。为了克服这一限制,我们提出了一种新的方法,以自监督的方式预训练DNN的特征提取部分。我们使用大量的空间视听记录作为虚拟现实内容。假设FOA麦克风和全向摄像机同时观察到声音对象,我们用对比学习联合训练音频和视觉编码器,使得相同记录和DOA的音频和视觉嵌入接近。我们的方法的一个关键特征是,DOA明智的音频嵌入是从原始音频数据中联合提取的,而DOA明智的视觉嵌入是分别从以相应DOA为中心的局部视觉作物中提取的。这鼓励音频编码器的潜在特征表示声音事件的类别和DOA两者。使用20小时的DCASE2022任务3数据集的实验显示,100小时的无注释视听记录将SELD的错误评分从36.4分降低到34.9分。
摘要:This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones. In this task, one may train a deep neural network (DNN) using FOA data annotated with the classes and directions of arrival (DOAs) of sound events. However, the performance of this approach is severely bounded by the amount of annotated data. To overcome this limitation, we propose a novel method of pretraining the feature extraction part of the DNN in a self-supervised manner. We use spatial audio-visual recordings abundantly available as virtual reality contents. Assuming that sound objects are concurrently observed by the FOA microphones and the omni-directional camera, we jointly train audio and visual encoders with contrastive learning such that the audio and visual embeddings of the same recording and DOA are made close. A key feature of our method is that the DOA-wise audio embeddings are jointly extracted from the raw audio data, while the DOA-wise visual embeddings are separately extracted from the local visual crops centered on the corresponding DOA. This encourages the latent features of the audio encoder to represent both the classes and DOAs of sound events. The experiment using the DCASE2022 Task 3 dataset of 20 hours shows non-annotated audio-visual recordings of 100 hours reduced the error score of SELD from 36.4 pts to 34.9 pts.
标题:质量感知的端到端视听神经扬声器扩展
链接:https://arxiv.org/abs/2410.22350
摘要:在本文中,我们提出了一个质量感知的端到端的视听神经说话人日志化框架,其中包括三个关键技术。首先,我们的视听模型将音频和视觉特征作为输入,利用一系列二进制分类输出层来同时识别所有说话者的活动。这种端到端框架经过精心设计,可有效处理重叠语音的情况,通过利用多模态信息,提供语音和非语音片段之间的准确区分。接下来,我们采用质量感知的视听融合结构来解决音频降级(如噪声、混响和其他失真)和视频降级(如遮挡、屏幕外扬声器或不可靠检测)的信号质量问题。最后,应用于多说话者嵌入的交叉注意机制使网络能够处理不同数量说话者的情况。我们的实验结果,从不同的数据集,证明了我们提出的技术在不同的声学环境中的鲁棒性。即使在视频质量严重下降的情况下,我们的系统也能达到与最佳视听系统相当的性能水平。
摘要:In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems.