本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2411.05715
摘要:人类能够融合来自听觉和视觉模态的信息,以帮助理解语音。这经常通过一种被称为麦格克效应的现象来证明,在这种现象中,听众呈现出不一致的听觉和视觉语音,这些语音融合在一起,形成一个虚幻的中间音素。基于最近提出如何使用人工神经网络解决发展“为什么”问题的框架,我们评估了一组最近的人工神经网络,通过测试视听不一致的单词来训练视听语音,以引起McGurk效应。我们比较了在干净语音上训练的网络和在嘈杂语音上训练的网络,发现使用嘈杂语音训练会导致所有模型的视觉反应和McGurk反应都有所增加。此外,我们观察到,在ANN训练过程中系统地增加听觉噪声水平也会在一定程度上增加视听整合的数量,但在极端的噪声水平下,这种整合未能发展。这些结果表明,在视听学习的关键时期,过度的噪声暴露可能会对视听语音整合的发展产生负面影响。这项工作还表明,McGurk效应可靠地出现在监督和无监督网络的行为中。这支持了人工神经网络可能是感知和认知某些方面的有用模型的观点。
摘要:Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is frequently demonstrated through an phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of an illusory intermediate phoneme. Building on a recent framework that proposes how to address developmental 'why' questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We compared networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to an increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition.
标题:通过所见说出所闻--通过文本生成视频到音频
链接:https://arxiv.org/abs/2411.05679
备注:NeurIPS 2024
摘要:视觉和音频场景的内容是多方面的,使得视频可以与各种音频配对,反之亦然。因此,在视频到音频生成任务中,引入用于控制所生成的音频的转向方法是必要的。虽然视频到音频生成是一个完善的生成任务,现有的方法缺乏这种可控性。在这项工作中,我们提出了VATT,一个多模态生成框架,需要一个视频和一个可选的文本提示作为输入,并生成音频和可选的文本描述的音频。这样的框架具有两个优点:i)视频到音频生成过程可以通过补充视觉信息的上下文的文本来细化和控制,以及ii)模型可以通过生成音频标题来建议为视频生成什么音频。VATT由两个关键模块组成:VATT Converter,一个针对指令进行微调的LLM,包括一个将视频特征映射到LLM向量空间的投影层; VATT Audio,一个使用迭代并行解码从视觉帧和可选文本提示生成音频令牌的Transformer。音频令牌通过预训练的神经编解码器转换为波形。实验表明,当VATT与现有的视频到音频生成方法在客观指标相比,它达到了竞争力的性能时,不提供音频字幕。当音频字幕作为提示提供时,VATT实现了更精细的性能(最低KLD得分为1.41)。此外,主观研究表明,VATT音频已被选为首选生成的音频比现有方法生成的音频。VATT通过文本实现可控的视频到音频生成,并通过音频字幕为视频提供文本提示,解锁文本引导的视频到音频生成和视频到音频字幕等新应用。
摘要:The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.
标题:Audiobox TTA-RAG:通过检索增强生成改进Zero-Shot和Few-Shot文本到音频
链接:https://arxiv.org/abs/2411.05141
摘要:当前领先的文本到音频(TTA)生成模型在zero-shot和Few-Shot设置上遭受性能降级。为训练集中看不见或不常见的音频事件生成高质量音频通常具有挑战性。受检索增强生成(RAG)在基于大型语言模型(LLM)的知识密集型任务中的成功启发,我们扩展了TTA过程,增加了条件上下文。我们提出了Audiobox TTA-RAG,一种新的检索增强TTA方法的基础上Audiobox,一个条件流匹配音频生成模型。与生成基于文本的音频的vanilla Audiobox TTA解决方案不同,我们使用检索到的音频样本增强了条件输入,这些音频样本提供了额外的声学信息来生成目标音频。我们的检索方法不需要外部数据库具有标记的音频,提供了更实用的用例。为了评估我们提出的方法,我们策划了测试集在zero-shot和Few-Shot设置。我们的实证结果表明,该模型可以有效地利用检索到的音频样本,并显着提高zero-shot和Few-Shot TTA的性能,在多个评估指标的大利润率,同时保持生成语义对齐的音频域内设置的能力。此外,我们调查的效果,不同的检索方法和数据源。
摘要:Current leading Text-To-Audio (TTA) generation models suffer from degraded performance on zero-shot and few-shot settings. It is often challenging to generate high-quality audio for audio events that are unseen or uncommon in the training set. Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Model (LLM)-based knowledge-intensive tasks, we extend the TTA process with additional conditioning contexts. We propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a conditional flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution which generates audio conditioned on text, we augmented the conditioning input with retrieved audio samples that provide additional acoustic information to generate the target audio. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. To evaluate our proposed method, we curated test sets in zero-shot and few-shot settings. Our empirical results show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in-domain setting. In addition, we investigate the effect of different retrieval methods and data sources.
标题:基于多通道非负矩阵分解的喘息检测环境去噪方法
链接:https://arxiv.org/abs/2411.05774
备注:None
摘要:本文提出了一种并行计算的方法,从听诊过程中捕获的多通道记录中执行背景去噪和喘息检测。该系统是基于非负矩阵分解(NMF)的方法和检测策略。此外,该模型的初始化基于奇异值分解,避免了对NMF参数初始值的依赖。此外,新的更新规则,同时解决多通道去噪,同时保持正交约束,以最大限度地提高源分离已被设计。已经针对喘息检测任务对所提出的系统进行了评估,当存在噪声声源时,显示出相对于现有技术的算法的显著改进。此外,并行和高性能的技术已被用来加速所提出的系统的执行,表明它是可能实现快速的执行时间,这使得它的实施在现实世界中的场景。
摘要:In this paper, a parallel computing method is proposed to perform the background denoising and wheezing detection from a multi-channel recording captured during the auscultation process. The proposed system is based on a non-negative matrix factorization (NMF) approach and a detection strategy. Moreover, the initialization of the proposed model is based on singular value decomposition to avoid dependence on the initial values of the NMF parameters. Additionally, novel update rules to simultaneously address the multichannel denoising while preserving an orthogonal constraint to maximize source separation have been designed. The proposed system has been evaluated for the task of wheezing detection showing a significant improvement over state-of-the-art algorithms when noisy sound sources are present. Moreover, parallel and high-performance techniques have been used to speedup the execution of the proposed system, showing that it is possible to achieve fast execution times, which enables its implementation in real-world scenarios.
标题:Audiobox TTA-RAG:通过检索增强生成改进Zero-Shot和Few-Shot文本到音频
链接:https://arxiv.org/abs/2411.05141
摘要:当前领先的文本到音频(TTA)生成模型在zero-shot和Few-Shot设置上遭受性能降级。为训练集中看不见或不常见的音频事件生成高质量音频通常具有挑战性。受检索增强生成(RAG)在基于大型语言模型(LLM)的知识密集型任务中的成功启发,我们扩展了TTA过程,增加了条件上下文。我们提出了Audiobox TTA-RAG,一种新的检索增强TTA方法的基础上Audiobox,一个条件流匹配音频生成模型。与生成基于文本的音频的vanilla Audiobox TTA解决方案不同,我们使用检索到的音频样本增强了条件输入,这些音频样本提供了额外的声学信息来生成目标音频。我们的检索方法不需要外部数据库具有标记的音频,提供了更实用的用例。为了评估我们提出的方法,我们策划了测试集在zero-shot和Few-Shot设置。我们的实证结果表明,该模型可以有效地利用检索到的音频样本,并显着提高zero-shot和Few-Shot TTA的性能,在多个评估指标的大利润率,同时保持生成语义对齐的音频域内设置的能力。此外,我们调查的效果,不同的检索方法和数据源。
摘要:Current leading Text-To-Audio (TTA) generation models suffer from degraded performance on zero-shot and few-shot settings. It is often challenging to generate high-quality audio for audio events that are unseen or uncommon in the training set. Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Model (LLM)-based knowledge-intensive tasks, we extend the TTA process with additional conditioning contexts. We propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a conditional flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution which generates audio conditioned on text, we augmented the conditioning input with retrieved audio samples that provide additional acoustic information to generate the target audio. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. To evaluate our proposed method, we curated test sets in zero-shot and few-shot settings. Our empirical results show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in-domain setting. In addition, we investigate the effect of different retrieval methods and data sources.
标题:论噪音在视听整合中的作用:来自表现出麦格克效应的人工神经网络的证据
链接:https://arxiv.org/abs/2411.05715
摘要:人类能够融合来自听觉和视觉模态的信息,以帮助理解语音。这经常通过一种被称为麦格克效应的现象来证明,在这种现象中,听众呈现出不一致的听觉和视觉语音,这些语音融合在一起,形成一个虚幻的中间音素。基于最近提出如何使用人工神经网络解决发展“为什么”问题的框架,我们评估了一组最近的人工神经网络,通过测试视听不一致的单词来训练视听语音,以引起McGurk效应。我们将使用干净语音训练的网络与使用嘈杂语音训练的网络进行了比较,发现使用嘈杂语音训练会导致所有模型的视觉反应和麦格克反应都有所增加。此外,我们观察到,在ANN训练过程中系统地增加听觉噪声水平也会在一定程度上增加视听整合的数量,但在极端的噪声水平下,这种整合未能发展。这些结果表明,在视听学习的关键时期,过度的噪声暴露可能会对视听语音整合的发展产生负面影响。这项工作还表明,McGurk效应可靠地出现在监督和无监督网络的行为中。这支持了人工神经网络可能是感知和认知某些方面的有用模型的观点。
摘要:Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is frequently demonstrated through an phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of an illusory intermediate phoneme. Building on a recent framework that proposes how to address developmental 'why' questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We compared networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to an increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition.
标题:通过所见说出所闻--通过文本生成视频到音频
链接:https://arxiv.org/abs/2411.05679
备注:NeurIPS 2024
摘要:视觉和音频场景的内容是多方面的,使得视频可以与各种音频配对,反之亦然。因此,在视频到音频生成任务中,引入用于控制所生成的音频的转向方法是必要的。虽然视频到音频生成是一个完善的生成任务,现有的方法缺乏这种可控性。在这项工作中,我们提出了VATT,一个多模态生成框架,需要一个视频和一个可选的文本提示作为输入,并生成音频和可选的文本描述的音频。这样的框架具有两个优点:i)视频到音频生成过程可以通过补充视觉信息的上下文的文本来细化和控制,以及ii)模型可以通过生成音频标题来建议为视频生成什么音频。VATT由两个关键模块组成:VATT Converter,一个针对指令进行微调的LLM,包括一个将视频特征映射到LLM向量空间的投影层; VATT Audio,一个使用迭代并行解码从视觉帧和可选文本提示生成音频令牌的Transformer。音频令牌通过预训练的神经编解码器转换为波形。实验表明,当VATT与现有的视频到音频生成方法在客观指标相比,它达到了竞争力的性能时,不提供音频字幕。当音频字幕作为提示提供时,VATT实现了更精细的性能(最低KLD得分为1.41)。此外,主观研究表明,VATT音频已被选为首选生成的音频比现有方法生成的音频。VATT通过文本实现可控的视频到音频生成,并通过音频字幕为视频提供文本提示,解锁文本引导的视频到音频生成和视频到音频字幕等新应用。
摘要:The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.
标题:Dynamic-SURB阶段-2:用于衡量具有180个任务的口语模型能力的协作扩展基准
链接:https://arxiv.org/abs/2411.05361
摘要:多模态基础模型,如Gemini和ChatGPT,通过无缝集成各种形式的数据,彻底改变了人机交互。开发一个通用的口语模型,可以表达各种自然语言指令,对于弥合沟通差距和促进更直观的互动至关重要。然而,缺乏一个全面的评价基准是一个重大挑战。我们提出了Dynamic-SUPERB Phase-2,这是一个开放的、不断发展的基准,用于对基于增强的通用语音模型进行综合评估。在第一代的基础上,第二版包含了全球研究界共同贡献的125个新任务,将基准扩展到总共180个任务,使其成为语音和音频评估的最大基准。虽然第一代Dynamic-SUPERB仅限于分类任务,但Dynamic-SUPERB Phase-2通过引入广泛的新颖和多样化的任务来扩展其评估功能,包括语音,音乐和环境音频的回归和序列生成。评价结果表明,没有一个模型普遍表现良好。SALMONN-13 B在英语ASR方面表现出色,而WavLLM在情感识别方面表现出了很高的准确性,但目前的模型仍需要进一步的创新来处理更广泛的任务。我们将很快开放所有任务数据和评估管道。
摘要:Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.
标题:利用上下文话语分析和LLM改进基于语音的情感识别
链接:https://arxiv.org/abs/2410.20334
摘要:语音情感识别(SER)专注于从口语中识别情感状态。2024年IEEE SLT-GenSEC后自动语音识别(ASR)情感识别挑战赛要求参与者探索仅使用文本数据进行情感识别的大型语言模型(LLM)的能力。我们提出了一种新的方法,首先完善所有可用的transmittance,以确保数据的可靠性。然后,我们将每个完整的对话分割成更小的对话,并使用这些对话作为上下文来预测对话中目标话语的情感。最后,我们研究了不同的上下文长度和提示技术,以提高预测精度。我们最好的提交在未加权准确度方面超过基线20%,在挑战中取得了最佳表现。我们所有的实验代码、预测结果和日志文件都是公开的。
摘要:Speech Emotion Recognition (SER) focuses on identifying emotional states from spoken language. The 2024 IEEE SLT-GenSEC Challenge on Post Automatic Speech Recognition (ASR) Emotion Recognition tasks participants to explore the capabilities of large language models (LLMs) for emotion recognition using only text data. We propose a novel approach that first refines all available transcriptions to ensure data reliability. We then segment each complete conversation into smaller dialogues and use these dialogues as context to predict the emotion of the target utterance within the dialogue. Finally, we investigated different context lengths and prompting techniques to improve prediction accuracy. Our best submission exceeded the baseline by 20% in unweighted accuracy, achieving the best performance in the challenge. All our experiments' codes, prediction results, and log files are publicly available.