本文经arXiv每日学术速递授权转载
【1】TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
链接:https://arxiv.org/abs/2412.21037
备注:this https URL
摘要:我们介绍了TangoFlux,一个高效的文本到音频(TTA)生成模型,具有515 M参数,能够在单个A40 GPU上仅用3.7秒生成长达30秒的44.1kHz音频。对齐TTA模型的一个关键挑战在于创建偏好对的困难,因为TTA缺乏可验证的奖励或可用于大型语言模型(LLM)的黄金标准答案等结构化机制。为了解决这个问题,我们提出了CLAP-Ranked Preference Optimization(CRPO),这是一个新的框架,可以迭代地生成和优化偏好数据,以增强TTA对齐。我们证明了使用CRPO生成的音频偏好数据集优于现有的替代品。有了这个框架,TangoFlux在客观和主观基准测试中都达到了最先进的性能。我们开源了所有的代码和模型,以支持TTA生成的进一步研究。
摘要:We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.
标题:使用协同注意网络的基于语音的音频检索
链接:https://arxiv.org/abs/2412.20914
备注:Accepted at UIC 2024 proceedings. Accepted version
摘要:近年来,用户生成的音频内容在各种媒体平台上激增,从而产生了对允许用户使用自然语言查询来搜索音频片段的高效检索方法的日益增长的需求。这项任务,被称为基于语言的音频检索,提出了重大的挑战,由于跨文本和音频模态的异构数据学习语义表示的复杂性。在这项工作中,我们引入了一个新的框架,基于语言的音频检索任务,利用共同注意机制,共同学习有意义的表示,从两种方式。为了增强模型捕获细粒度跨模态交互的能力,我们提出了一种级联的共同注意力架构,其中共同注意力模块被堆叠或迭代以逐步完善文本和音频之间的语义对齐。在两个公开数据集上的实验表明,该方法比现有方法具有更好的性能。具体来说,我们表现最好的共同注意力模型在Clotho数据集上的平均精度提高了16.6%,在AudioCaps上提高了15.1%。
摘要:In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 15.1% improvement on AudioCaps.
标题:Audiopedia:具有知识的音频QA
链接:https://arxiv.org/abs/2412.20619
备注:Accepted to ICASSP 2025
摘要:在本文中,我们介绍了Audiopedia,一个新的任务称为音频问题与知识,它需要音频理解和外部知识推理。与传统的音频问答(AQA)基准测试不同,它专注于仅从音频回答的简单查询,Audiopedia针对知识密集型问题。我们定义了三个子任务:(i)单音频问题查询(s-AQA),其中基于单个音频样本回答问题,(ii)多音频问题查询(m-AQA),其需要对多个音频样本进行推理,以及(iii)检索增强音频问题查询(r-AQA),其涉及检索相关音频以回答问题。我们在这些子任务上对大型音频语言模型(LALM)进行基准测试,并观察到次优性能。为了解决这个问题,我们提出了一个可以适应任何LALM的通用框架,为它们提供知识推理能力。我们的框架有两个组成部分:(i)音频实体链接(AEL)和(ii)知识增强音频大型多模态模型(KA 2LM),它们共同提高了知识密集型AQA任务的性能。据我们所知,这是通过Audiopedia等知识密集型任务解决高级音频理解的第一项工作。
摘要:In this paper, we introduce Audiopedia, a novel task called Audio Question Answering with Knowledge, which requires both audio comprehension and external knowledge reasoning. Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions. We define three sub-tasks: (i) Single Audio Question Answering (s-AQA), where questions are answered based on a single audio sample, (ii) Multi-Audio Question Answering (m-AQA), which requires reasoning over multiple audio samples, and (iii) Retrieval-Augmented Audio Question Answering (r-AQA), which involves retrieving relevant audio to answer the question. We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance. To address this, we propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities. Our framework has two components: (i) Audio Entity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model (KA2LM), which together improve performance on knowledge-intensive AQA tasks. To our knowledge, this is the first work to address advanced audio understanding via knowledge-intensive tasks like Audiopedia.
标题:Tri-Ergon:具有多模式条件和LUFS控制的细粒度视频到音频生成
链接:https://arxiv.org/abs/2412.20378
备注:AAAI 2025 Accepted
摘要:视频到音频(V2 A)生成利用仅视觉视频特征来产生对应于场景的逼真声音。然而,当前的V2 A模型通常缺乏对所生成的音频的细粒度控制,特别是在响度变化和多模态条件的并入方面。为了克服这些限制,我们引入了Tri-Ergon,一种基于扩散的V2 A模型,它结合了文本,听觉和像素级视觉提示,以实现详细和语义丰富的音频合成。此外,我们还引入了相对于满量程的响度单位(LUFS)嵌入,它允许精确手动控制各个音频通道随时间的响度变化,使我们的模型能够有效地解决现实世界Foley工作流程中视频和音频的复杂相关性。Tri-Ergon能够创建44.1 kHz高保真立体声音频片段,长度最长可达60秒,明显优于现有的最先进的V2 A方法,这些方法通常生成固定持续时间的单声道音频。
摘要:Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
标题:Stable-TTC:通过韵律插值实现稳定的扬声器自适应文本到语音合成
链接:https://arxiv.org/abs/2412.20155
备注:Accepted by ICASSP 2025
摘要:说话人自适应的文语转换(TTS)合成由于其广泛的应用,如个性化语音助理服务,引起了人们的广泛关注。虽然已经提出了几种方法,但它们通常对目标语音样本的数量或质量表现出高灵敏度。为了解决这些限制,我们引入了Stable-TTS,这是一种新型的说话者自适应TTS框架,它利用了高质量预训练数据集的一个小子集,称为先验样本。具体而言,稳定TTS实现韵律一致性,利用高质量的韵律先前的样本,同时有效地捕捉目标扬声器的音色。此外,它在微调期间采用先验保持损失来保持先前样本的合成能力,以防止对目标样本的过拟合。大量的实验表明,即使在有限的数量和嘈杂的目标语音样本的稳定TTS的有效性。
摘要:Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.
标题:ASE:通过声扩散场超越多普勒的实用音速估计
链接:https://arxiv.org/abs/2412.20142
摘要:被动人体速度估计在声学传感中起着关键作用。尽管广泛的研究,现有的系统,然而,遭受各种限制:首先,以前的声速估计利用多普勒频移(DFS)创建的移动目标,并依赖于麦克风阵列,使他们只能感测的径向速度在一个有限的距离。第二,信道测量速率证明不足以估计高移动速度。为了克服这些问题,我们提出ASE,一个准确和强大的声速估计系统上的一个单一的商品麦克风。我们从声学扩散场的独特视角对声音传播进行建模,并从声学空间分布中推断速度,这是一种完全不同的速度估计方式,超越了先前基于DFS的方法。然后,我们提出了一种新的正交时延复用(OTDM)方案,用于在高速率下进行声学信道估计,这在以前是不可行的,从而可以估计高速度。我们进一步开发用于运动检测和信号增强的新技术,以提供强大且实用的系统。我们通过广泛的实际实验来实现和评估ASE。我们的研究结果表明,ASE可靠地跟踪步行速度,与目标位置和方向无关,平均误差为0.13 m/s,比DFS减少2.5倍,并且对于大覆盖范围的检测率为97.4%,例如,在一个价值400万美元的房间里自由行走。我们相信ASE将声速估计推到了传统的基于DFS的范例之外,并将激发声学传感领域令人兴奋的研究。
摘要:Passive human speed estimation plays a critical role in acoustic sensing. Despite extensive study, existing systems, however, suffer from various limitations: First, previous acoustic speed estimation exploits Doppler Frequency Shifts (DFS) created by moving targets and relies on microphone arrays, making them only capable of sensing the radial speed within a constrained distance. Second, the channel measurement rate proves inadequate to estimate high moving speeds. To overcome these issues, we present ASE, an accurate and robust Acoustic Speed Estimation system on a single commodity microphone. We model the sound propagation from a unique perspective of the acoustic diffusion field, and infer the speed from the acoustic spatial distribution, a completely different way of thinking about speed estimation beyond prior DFS-based approaches. We then propose a novel Orthogonal Time-Delayed Multiplexing (OTDM) scheme for acoustic channel estimation at a high rate that was previously infeasible, making it possible to estimate high speeds. We further develop novel techniques for motion detection and signal enhancement to deliver a robust and practical system. We implement and evaluate ASE through extensive real-world experiments. Our results show that ASE reliably tracks walking speed, independently of target location and direction, with a mean error of 0.13 m/s, a reduction of 2.5x from DFS, and a detection rate of 97.4% for large coverage, e.g., free walking in a 4m $\times$ 4m room. We believe ASE pushes acoustic speed estimation beyond the conventional DFS-based paradigm and will inspire exciting research in acoustic sensing.
标题:基于口腔关节的锚定改进跨数据库语音情感识别
链接:https://arxiv.org/abs/2412.19909
摘要:跨语料库语音情感识别在许多实际应用中起着至关重要的作用。传统的跨语料库情感迁移方法通常集中在调整声学特征以与不同的语料库、域或标签对齐。然而,由于扬声器差异、域偏移和记录条件等因素,声学特征固有地可变且容易出错。为了应对这些挑战,本研究采用了一种新的对比方法,专注于情感特定的发音手势作为分析的核心要素。通过将重点转移到更稳定和一致的发音手势上,我们的目标是增强SER任务中的情绪迁移学习。我们的研究利用CREMA-D和MSP-IMPROV语料库作为基准,它揭示了这些发音手势的共性和可靠性的宝贵见解。研究结果强调了嘴发音手势的潜力,作为一个更好的约束,以提高跨不同的设置或域的情感识别。
摘要:Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.
标题:通过多粒度跨模式对齐增强多模式情绪识别
链接:https://arxiv.org/abs/2412.20821
备注:ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
摘要:多模态情感识别(MER),利用语音和文本,已经成为人机交互中的一个关键领域,需要复杂的方法进行有效的多模态集成。在这些模式中对齐特征的挑战是巨大的,大多数现有方法都采用单一的对齐策略。这种狭隘的关注不仅限制了模型的性能,而且未能解决情感表达中固有的复杂性和模糊性。作为回应,本文介绍了一个多粒度跨模态对齐(MGCMA)框架,其全面的方法,包括基于分布,基于实例和基于标记的对齐模块。该框架实现了跨模态的情感信息的多层次感知。IEMOCAP上的实验表明,我们提出的方法优于当前最先进的技术。
摘要:Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performance but also fails to address the complexity and ambiguity inherent in emotional expressions. In response, this paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules. This framework enables a multi-level perception of emotional information across modalities. Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.
标题:基于灵活注册的用户定义关键词发现的音素级对比学习
链接:https://arxiv.org/abs/2412.20805
摘要:用户定义的关键字识别(KWS)通过允许个人自定义关键字来增强用户体验。然而,在开放词汇表的情况下,大多数现有的方法通常遭受高误报率与易混淆的单词,并限于要么音频或仅文本登记。因此,在本文中,我们首先探讨该模型的鲁棒性对易混淆的话。具体来说,我们提出了音素级别对比学习(PLCL),它在音素级别细化和对齐查询和源特征表示。该方法通过细粒度的正反比较提高模型的消歧能力,实现更精确的匹配,并具有通用性,可同时优化音-文匹配和音-音匹配,适应多种招生模式。此外,我们保持一个上下文无关的音素记忆库,以构建易混淆的否定数据增强。在此基础上,第三类底片被专门设计用于区分硬底片。总体而言,我们开发了一个强大而灵活的KWS系统,在一个统一的框架内支持不同的模态注册方法。在LibriPhrase数据集上进行了验证,所提出的方法实现了最先进的性能。
摘要:User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.
标题:低资源条件下改进声场景分类
链接:https://arxiv.org/abs/2412.20722
备注:accepted by ICASSP2025. \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component
摘要:声学场景分类(ASC)基于音频信号来识别环境。本文探讨了低资源条件下的ASC,并提出了一种新的模型DS-ARMNet,该模型将MobileNetV 2中的dependency可分离卷积与ResNet启发的剩余连接相结合,以平衡效率和准确性。为了解决硬件限制和设备异构性问题,DS-BSNet采用量化感知训练(QAT)进行模型压缩和数据增强方法,如自动设备脉冲响应(AIDR)和频率混合风格(FMS),以提高跨设备泛化能力。来自12个教师模型的知识蒸馏(KD)进一步增强了看不见的设备上的性能。该架构包括一个自定义的残差归一化层来处理跨设备的域差异,并且依赖可分离卷积在不牺牲特征表示的情况下降低了计算开销。实验结果表明,DS-ARMNet在资源受限的情况下具有较好的适应性和性能。
摘要:Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing feature representation. Experimental results show that DS-FlexiNet excels in both adaptability and performance under resource-constrained conditions.
标题:元数据增强的语音情感识别:两阶段微调中的增强残留积分和共同注意力
链接:https://arxiv.org/abs/2412.20707
备注:accepted by ICASSP2025. \c{opyright}2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component
摘要:语音情感识别是通过分析说话人的语音表达来判断说话人的情感状态,其中对音频信息的全面、充分利用至关重要。因此,我们提出了一种新的自监督学习(SSL)模型,采用所有可用的辅助信息-特别是元数据-以提高性能。通过多任务学习中的两阶段微调方法,我们引入了增强残差积分(ARI)模块,该模块增强了SSL模型编码器中的Transformer层。该模块有效地保留了所有不同级别的声学特征,从而显着提高了需要各种级别特征的元数据相关辅助任务的性能。此外,由于其与ARI的互补性,将共同注意力模块合并,使模型能够有效地利用元数据相关辅助任务的多维信息和上下文关系。在预训练的基础模型和说话者独立设置下,我们的方法在IEMOCAP数据集的多个SSL编码器上始终超过最先进的(SOTA)模型。
摘要:Speech Emotion Recognition (SER) involves analyzing vocal expressions to determine the emotional state of speakers, where the comprehensive and thorough utilization of audio information is paramount. Therefore, we propose a novel approach on self-supervised learning (SSL) models that employs all available auxiliary information -- specifically metadata -- to enhance performance. Through a two-stage fine-tuning method in multi-task learning, we introduce the Augmented Residual Integration (ARI) module, which enhances transformer layers in encoder of SSL models. The module efficiently preserves acoustic features across all different levels, thereby significantly improving the performance of metadata-related auxiliary tasks that require various levels of features. Moreover, the Co-attention module is incorporated due to its complementary nature with ARI, enabling the model to effectively utilize multidimensional information and contextual relationships from metadata-related auxiliary tasks. Under pre-trained base models and speaker-independent setup, our approach consistently surpasses state-of-the-art (SOTA) models on multiple SSL encoders for the IEMOCAP dataset.
标题:SYS Reg:基于扩散的语音转换中情感强度规则化的定向潜在向量建模
链接:https://arxiv.org/abs/2412.20359
备注:Accepted to AAAI 2025
摘要:情感语音转换(EVC)的目的是在保持语言内容的同时,将离散的情感状态从源情感转换为目标情感。在本文中,我们提出了正则化情绪强度的扩散为基础的EVC框架,以产生精确的语音的目标情感。传统的方法通过情感类别概率或强度标签来控制话语中情感状态的强度,这通常导致不适当的风格操作和质量下降。相反,我们的目标是在基于扩散的框架内,在情感嵌入空间中使用基于自监督学习的特征表示和无监督方向潜在向量建模(DVM)来调节情感强度。这些情感嵌入可以基于给定的目标情感强度和对应的方向向量来修改。此外,更新后的嵌入可以在反向扩散过程中融合,以生成具有所需情感和强度的语音。总之,本文的目标是实现高质量的情绪强度正则化的扩散为基础的EVC框架,这是第一个同类工作。在英语和印地语的主观和客观评估方面,所提出的方法的有效性已经在最先进的(SOTA)基线上得到了证明\footnote{演示样本可在以下URL获得:\url{https://nirmesh-sony.github.io/nirmeshReg/}}。
摘要:The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages \footnote{Demo samples are available at the following URL: \url{https://nirmesh-sony.github.io/EmoReg/}}.
标题:使用自监督的解纠缠表示学习的鸟类发声嵌入提取
链接:https://arxiv.org/abs/2412.20146
备注:Presented on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR 2024), this https URL
摘要:提出了一种基于解纠缠表示学习(DRL)的鸟鸣声嵌入提取方法。鸟类发声嵌入是大规模生物声学任务所必需的,并且诸如变分自动编码器(VAE)的自监督方法已经在从音符或音节水平上的发声片段中提取这种低维嵌入方面表现出了它们的性能。为了将处理层次扩展到整首歌曲,而不是分割成片段,本文将每个发声视为广义和判别部分,并使用两个编码器来学习这两个部分。在大山雀数据集上对该方法进行了聚类性能评估,结果优于预训练模型和vanilla VAE。最后,本文分析了嵌入的信息部分,进一步压缩了其维数,并解释了鸟叫声的解纠缠性能。
摘要:This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the note or syllable level. To extend the processing level to the entire song instead of cutting into segments, this paper regards each vocalization as the generalized and discriminative part and uses two encoders to learn these two parts. The proposed method is evaluated on the Great Tits dataset according to the clustering performance, and the results outperform the compared pre-trained models and vanilla VAE. Finally, this paper analyzes the informative part of the embedding, further compresses its dimension, and explains the disentangled performance of bird vocalizations.
标题:基于距离的单通道目标语音提取
链接:https://arxiv.org/abs/2412.20144
备注:5 pages, 3 figures, accepted by ICASSP 2025
摘要:本文的目的是实现单通道目标语音提取(TSE)在封闭的距离信息单独使用。这是第一个工作,只利用距离线索,而不使用扬声器生理信息的单通道TSE。受最近单通道基于距离的分离和提取方法的启发,我们引入了一种新的模型,该模型有效地将距离信息与时频(TF)仓融合用于TSE。在单房间和多房间场景下的实验结果证明了该方法的可行性和有效性。该方法也可用于混合语音中不同说话人的距离估计。在线演示可在https://runwushi.github.io/distance-demo-page上获得。
摘要:This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.
标题:CrossSpeech++:具有脱钩语言和说话人生成的跨语言语音合成
链接:https://arxiv.org/abs/2412.20048
摘要:这项工作的目标是生成多种语言的自然语音,同时保持相同的扬声器身份,这一任务称为跨语言语音合成。跨语言语音合成的一个关键挑战是语言-说话人纠缠问题,它导致跨语言系统的质量落后于语言内系统。在本文中,我们提出了CrossSpeech++,它有效地解开语言和说话人的信息,显着提高跨语言语音合成的质量。为此,我们将复杂的语音生成管道分解为两个简单的组件:依赖于语言和依赖于说话者的生成器。语言相关的生成器产生的语言变化,不偏于特定的扬声器属性。扬声器相关的发生器模型的声学变化,表征扬声器的身份。通过在单独的模块中处理每种类型的信息,我们的方法可以有效地解开语言和说话人表示。我们使用各种指标进行了广泛的实验,并证明CrossSpeech++在跨语言语音合成方面取得了显着的改进,大大优于现有的方法。
摘要:The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.
标题:通过多粒度跨模式对齐增强多模式情绪识别
链接:https://arxiv.org/abs/2412.20821
备注:ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
摘要:利用语音和文本的多模态情感识别(MER)已成为人机交互中的一个关键领域,需要复杂的方法来实现有效的多模态集成。在这些模式中对齐特征的挑战是巨大的,大多数现有方法都采用单一的对齐策略。这种狭隘的关注不仅限制了模型的性能,而且未能解决情感表达中固有的复杂性和模糊性。作为回应,本文介绍了一个多粒度跨模态对齐(MGCMA)框架,其全面的方法,包括基于分布,基于实例和基于标记的对齐模块。该框架实现了跨模态的情感信息的多层次感知。IEMOCAP上的实验表明,我们提出的方法优于当前最先进的技术。
摘要:Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performance but also fails to address the complexity and ambiguity inherent in emotional expressions. In response, this paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules. This framework enables a multi-level perception of emotional information across modalities. Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.
标题:基于灵活注册的用户定义关键词发现的音素级对比学习
链接:https://arxiv.org/abs/2412.20805
摘要:用户定义的关键字识别(KWS)通过允许个人自定义关键字来增强用户体验。然而,在开放词汇表的情况下,大多数现有的方法通常遭受高误报率与易混淆的单词,并限于要么音频或仅文本登记。因此,在本文中,我们首先探讨该模型的鲁棒性对易混淆的话。具体来说,我们提出了音素级对比学习(PLCL),它细化和对齐查询和源特征表示在音素级。该方法通过细粒度的正反比较提高模型的消歧能力,实现更精确的匹配,并具有通用性,可同时优化音-文匹配和音-音匹配,适应多种招生模式。此外,我们保持一个上下文无关的音素记忆库,以构建易混淆的否定数据增强。在此基础上,第三类底片被专门设计用于区分硬底片。总体而言,我们开发了一个强大而灵活的KWS系统,在一个统一的框架内支持不同的模态注册方法。在LibriPhrase数据集上进行了验证,所提出的方法实现了最先进的性能。
摘要:User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.
标题:低资源条件下改进声场景分类
链接:https://arxiv.org/abs/2412.20722
备注:accepted by ICASSP2025. \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component
摘要:声学场景分类(ASC)基于音频信号来识别环境。本文探讨了低资源条件下的ASC,并提出了一种新的模型DS-ARMNet,该模型将MobileNetV 2中的dependency可分离卷积与ResNet启发的剩余连接相结合,以平衡效率和准确性。为了解决硬件限制和设备异构性问题,DS-BSNet采用量化感知训练(QAT)进行模型压缩和数据增强方法,如自动设备脉冲响应(AIDR)和频率混合风格(FMS),以提高跨设备泛化能力。来自12个教师模型的知识蒸馏(KD)进一步增强了看不见的设备上的性能。该架构包括一个自定义的残差归一化层来处理跨设备的域差异,并且依赖可分离卷积在不牺牲特征表示的情况下降低了计算开销。实验结果表明,DS-ARMNet在资源受限的情况下具有较好的适应性和性能。
摘要:Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing feature representation. Experimental results show that DS-FlexiNet excels in both adaptability and performance under resource-constrained conditions.
标题:元数据增强的语音情感识别:两阶段微调中的增强残留积分和共同注意力
链接:https://arxiv.org/abs/2412.20707
备注:accepted by ICASSP2025. \c{opyright}2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component
摘要:语音情感识别是通过分析说话人的语音表达来判断说话人的情感状态,其中对音频信息的全面、充分利用至关重要。因此,我们提出了一种新的自监督学习(SSL)模型,采用所有可用的辅助信息-特别是元数据-以提高性能。通过多任务学习中的两阶段微调方法,我们引入了增强残差积分(ARI)模块,该模块增强了SSL模型编码器中的Transformer层。该模块有效地保留了所有不同级别的声学特征,从而显著提高了需要各种级别特征的元数据相关辅助任务的性能。此外,由于其与ARI的互补性,将共同注意力模块合并,使模型能够有效地利用元数据相关辅助任务的多维信息和上下文关系。在预训练的基础模型和说话者独立设置下,我们的方法在IEMOCAP数据集的多个SSL编码器上始终超过最先进的(SOTA)模型。
摘要:Speech Emotion Recognition (SER) involves analyzing vocal expressions to determine the emotional state of speakers, where the comprehensive and thorough utilization of audio information is paramount. Therefore, we propose a novel approach on self-supervised learning (SSL) models that employs all available auxiliary information -- specifically metadata -- to enhance performance. Through a two-stage fine-tuning method in multi-task learning, we introduce the Augmented Residual Integration (ARI) module, which enhances transformer layers in encoder of SSL models. The module efficiently preserves acoustic features across all different levels, thereby significantly improving the performance of metadata-related auxiliary tasks that require various levels of features. Moreover, the Co-attention module is incorporated due to its complementary nature with ARI, enabling the model to effectively utilize multidimensional information and contextual relationships from metadata-related auxiliary tasks. Under pre-trained base models and speaker-independent setup, our approach consistently surpasses state-of-the-art (SOTA) models on multiple SSL encoders for the IEMOCAP dataset.
标题:SYS Reg:基于扩散的语音转换中情感强度规则化的定向潜在向量建模
链接:https://arxiv.org/abs/2412.20359
备注:Accepted to AAAI 2025
摘要:情感语音转换(EVC)的目的是在保持语言内容的同时,将离散的情感状态从源情感转换为目标情感。在本文中,我们提出了正则化情绪强度的扩散为基础的EVC框架,以产生精确的语音的目标情感。传统的方法通过情感类别概率或强度标签来控制话语中情感状态的强度,这通常导致不适当的风格操作和质量下降。相反,我们的目标是在基于扩散的框架内,在情感嵌入空间中使用基于自监督学习的特征表示和无监督方向潜在向量建模(DVM)来调节情感强度。这些情感嵌入可以基于给定的目标情感强度和对应的方向向量来修改。此外,更新后的嵌入可以在反向扩散过程中融合,以生成具有所需情感和强度的语音。总之,本文的目标是实现高质量的情绪强度正则化的扩散为基础的EVC框架,这是第一个同类工作。在英语和印地语的主观和客观评估方面,所提出的方法的有效性已经在最先进的(SOTA)基线上得到了证明\footnote{演示样本可在以下URL获得:\url{https://nirmesh-sony.github.io/nirmeshReg/}}。
摘要:The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages \footnote{Demo samples are available at the following URL: \url{https://nirmesh-sony.github.io/EmoReg/}}.
标题:使用自监督的解纠缠表示学习的鸟类发声嵌入提取
链接:https://arxiv.org/abs/2412.20146
备注:Presented on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR 2024), this https URL
摘要:提出了一种基于解纠缠表示学习(DRL)的鸟鸣声嵌入提取方法。鸟类发声嵌入对于大规模生物声学任务来说是必要的,变分自动编码器(VAE)等自监督方法已经在从音符或音节级别的发声片段中提取此类低维嵌入方面表现出了性能。为了将处理层次扩展到整首歌曲,而不是分割成片段,本文将每个发声视为广义和判别部分,并使用两个编码器来学习这两个部分。在大山雀数据集上对该方法进行了聚类性能评估,结果优于预训练模型和vanilla VAE。最后,本文分析了嵌入的信息部分,进一步压缩了其维数,并解释了鸟叫声的解纠缠性能。
摘要:This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the note or syllable level. To extend the processing level to the entire song instead of cutting into segments, this paper regards each vocalization as the generalized and discriminative part and uses two encoders to learn these two parts. The proposed method is evaluated on the Great Tits dataset according to the clustering performance, and the results outperform the compared pre-trained models and vanilla VAE. Finally, this paper analyzes the informative part of the embedding, further compresses its dimension, and explains the disentangled performance of bird vocalizations.
标题:基于距离的单通道目标语音提取
链接:https://arxiv.org/abs/2412.20144
备注:5 pages, 3 figures, accepted by ICASSP 2025
摘要:本文的目的是实现单通道目标语音提取(TSE)在封闭的距离信息单独使用。这是第一个工作,只利用距离线索,而不使用扬声器生理信息的单通道TSE。受最近单通道基于距离的分离和提取方法的启发,我们引入了一种新的模型,该模型有效地将距离信息与时频(TF)仓融合用于TSE。在单房间和多房间场景下的实验结果证明了该方法的可行性和有效性。该方法也可用于混合语音中不同说话人的距离估计。在线演示可在https://runwushi.github.io/distance-demo-page上获得。
摘要:This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.
标题:CrossSpeech++:具有脱钩语言和说话人生成的跨语言语音合成
链接:https://arxiv.org/abs/2412.20048
摘要:这项工作的目标是生成多种语言的自然语音,同时保持相同的说话人身份,这一任务称为跨语言语音合成。跨语言语音合成的一个关键挑战是语言-说话人纠缠问题,它导致跨语言系统的质量落后于语言内系统。在本文中,我们提出了CrossSpeech++,它有效地解开语言和说话人的信息,显着提高跨语言语音合成的质量。为此,我们将复杂的语音生成管道分解为两个简单的组件:依赖于语言和依赖于说话者的生成器。语言相关的生成器产生的语言变化,不偏于特定的扬声器属性。扬声器相关的发生器模型的声学变化,表征扬声器的身份。通过在单独的模块中处理每种类型的信息,我们的方法可以有效地解开语言和说话人表示。我们使用各种指标进行了广泛的实验,并证明CrossSpeech++在跨语言语音合成方面取得了显着的改进,大大优于现有的方法。
摘要:The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.
标题:ECoG中语音激活-抑制的双分量时空模板
链接:https://arxiv.org/abs/2412.21178
摘要:我计算的平均试验的试验功率的频带限制的语音活动跨时期的多通道高密度皮层脑电图(ECoG)记录从多个科目在一个辅音-元音说话任务。我表明,以前看到的平均β频率活动(12-35 Hz)的反相关性,以高频率的γ活动(70-140 Hz)在言语运动中的个别ECoG通道之间的感觉运动皮层(SMC)是可观察的。有了这个,我适合一个基于方差的模型,使用主成分分析的会话平均ECoG数据在SMC和项目SMC通道到其低维主成分的个别通道的带功率。 通过使用窗口相关性将两个频带的主成分与各个ECoG通道随时间相关来识别语音相关活动和主成分之间的时空关系。感觉运动区的主成分区域的相关性揭示了一个独特的双成分激活-抑制样的语音表示,类似于不同的局部感觉运动区最近被证明有复杂的相互作用,在全身运动控制,抑制和姿势。值得注意的是,第三主成分显示所有受试者之间的相关性不显著,这表明ECoG的两个成分足以代表言语运动期间的SMC活动。
摘要:I compute the average trial-by-trial power of band-limited speech activity across epochs of multi-channel high-density electrocorticography (ECoG) recorded from multiple subjects during a consonant-vowel speaking task. I show that previously seen anti-correlations of average beta frequency activity (12-35 Hz) to high-frequency gamma activity (70-140 Hz) during speech movement are observable between individual ECoG channels in the sensorimotor cortex (SMC). With this I fit a variance-based model using principal component analysis to the band-powers of individual channels of session-averaged ECoG data in the SMC and project SMC channels onto their lower-dimensional principal components. Spatiotemporal relationships between speech-related activity and principal components are identified by correlating the principal components of both frequency bands to individual ECoG channels over time using windowed correlation. Correlations of principal component areas to sensorimotor areas reveal a distinct two-component activation-inhibition-like representation for speech that resembles distinct local sensorimotor areas recently shown to have complex interplay in whole-body motor control, inhibition, and posture. Notably the third principal component shows insignificant correlations across all subjects, suggesting two components of ECoG are sufficient to represent SMC activity during speech movement.
标题:TangoFlux:超快速、忠实的文本到音频生成,具有流匹配和标签排序偏好优化
链接:https://arxiv.org/abs/2412.21037
备注:this https URL
摘要:我们介绍了TangoFlux,一个高效的文本到音频(TTA)生成模型,具有515 M参数,能够在单个A40 GPU上仅用3.7秒生成长达30秒的44.1kHz音频。对齐TTA模型的一个关键挑战在于创建偏好对的困难,因为TTA缺乏可验证的奖励或可用于大型语言模型(LLM)的黄金标准答案等结构化机制。为了解决这个问题,我们提出了CLAP-Ranked Preference Optimization(CRPO),这是一个新的框架,可以迭代地生成和优化偏好数据,以增强TTA对齐。我们证明了使用CRPO生成的音频偏好数据集优于现有的替代品。有了这个框架,TangoFlux在客观和主观基准测试中都达到了最先进的性能。我们开源了所有的代码和模型,以支持TTA生成的进一步研究。
摘要:We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.
标题:使用协同注意网络的基于语音的音频检索
链接:https://arxiv.org/abs/2412.20914
备注:Accepted at UIC 2024 proceedings. Accepted version
摘要:近年来,用户生成的音频内容在各种媒体平台上激增,从而产生了对允许用户使用自然语言查询来搜索音频片段的高效检索方法的日益增长的需求。这项任务,被称为基于语言的音频检索,提出了重大的挑战,由于跨文本和音频模态的异构数据学习语义表示的复杂性。在这项工作中,我们引入了一个新的框架,基于语言的音频检索任务,利用共同注意机制,共同学习有意义的表示,从两种方式。为了增强模型捕获细粒度跨模态交互的能力,我们提出了一种级联的共同注意力架构,其中共同注意力模块被堆叠或迭代以逐步完善文本和音频之间的语义对齐。在两个公开数据集上的实验表明,该方法比现有方法具有更好的性能。具体来说,我们表现最好的共同注意力模型在Clotho数据集上的平均精度提高了16.6%,在AudioCaps上提高了15.1%。
摘要:In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 15.1% improvement on AudioCaps.
标题:Audiopedia:具有知识的音频QA
链接:https://arxiv.org/abs/2412.20619
备注:Accepted to ICASSP 2025
摘要:None
摘要:In this paper, we introduce Audiopedia, a novel task called Audio Question Answering with Knowledge, which requires both audio comprehension and external knowledge reasoning. Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions. We define three sub-tasks: (i) Single Audio Question Answering (s-AQA), where questions are answered based on a single audio sample, (ii) Multi-Audio Question Answering (m-AQA), which requires reasoning over multiple audio samples, and (iii) Retrieval-Augmented Audio Question Answering (r-AQA), which involves retrieving relevant audio to answer the question. We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance. To address this, we propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities. Our framework has two components: (i) Audio Entity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model (KA2LM), which together improve performance on knowledge-intensive AQA tasks. To our knowledge, this is the first work to address advanced audio understanding via knowledge-intensive tasks like Audiopedia.
标题:Tri-Ergon:具有多模式条件和LUFS控制的细粒度视频到音频生成
链接:https://arxiv.org/abs/2412.20378
备注:AAAI 2025 Accepted
摘要:视频到音频(V2 A)生成利用仅视觉视频特征来产生对应于场景的逼真声音。然而,当前的V2 A模型通常缺乏对所生成的音频的细粒度控制,特别是在响度变化和多模态条件的并入方面。为了克服这些限制,我们引入了Tri-Ergon,一种基于扩散的V2 A模型,它结合了文本,听觉和像素级视觉提示,以实现详细和语义丰富的音频合成。此外,我们还引入了相对于满量程的响度单位(LUFS)嵌入,它允许精确手动控制各个音频通道随时间的响度变化,使我们的模型能够有效地解决现实世界Foley工作流程中视频和音频的复杂相关性。Tri-Ergon能够创建44.1 kHz高保真立体声音频片段,长度最长可达60秒,明显优于现有的最先进的V2 A方法,这些方法通常生成固定持续时间的单声道音频。
摘要:Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
标题:Stable-TTC:通过韵律插值实现稳定的扬声器自适应文本到语音合成
链接:https://arxiv.org/abs/2412.20155
备注:Accepted by ICASSP 2025
摘要:说话人自适应的文语转换(TTS)合成由于其广泛的应用,如个性化语音助理服务,引起了人们的广泛关注。虽然已经提出了几种方法,但它们通常对目标语音样本的数量或质量表现出高灵敏度。为了解决这些限制,我们引入了Stable-TTS,这是一种新型的说话者自适应TTS框架,它利用了高质量预训练数据集的一个小子集,称为先验样本。具体而言,稳定TTS实现韵律一致性,利用高质量的韵律先前的样本,同时有效地捕捉目标扬声器的音色。此外,它在微调期间采用先验保持损失来保持先前样本的合成能力,以防止对目标样本的过拟合。大量的实验表明,即使在有限的数量和嘈杂的目标语音样本的稳定TTS的有效性。
摘要:Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.
标题:ASE:通过声扩散场超越多普勒的实用音速估计
链接:https://arxiv.org/abs/2412.20142
摘要:被动人体速度估计在声学传感中起着关键作用。尽管广泛的研究,现有的系统,然而,遭受各种限制:首先,以前的声速估计利用多普勒频移(DFS)创建的移动目标,并依赖于麦克风阵列,使他们只能感测的径向速度在一个有限的距离。第二,信道测量速率证明不足以估计高移动速度。为了克服这些问题,我们提出ASE,一个准确和强大的声速估计系统在一个单一的商品麦克风。我们从声学扩散场的独特视角对声音传播进行建模,并从声学空间分布中推断速度,这是一种完全不同的速度估计方式,超越了先前基于DFS的方法。然后,我们提出了一种新的正交时延复用(OTDM)方案,用于在高速率下进行声学信道估计,这在以前是不可行的,从而可以估计高速度。我们进一步开发用于运动检测和信号增强的新技术,以提供强大且实用的系统。我们通过广泛的实际实验来实现和评估ASE。我们的研究结果表明,ASE可靠地跟踪步行速度,与目标位置和方向无关,平均误差为0.13 m/s,比DFS减少2.5倍,并且对于大覆盖范围的检测率为97.4%,例如,在一个价值400万美元的房间里自由行走。我们相信ASE将声速估计推到了传统的基于DFS的范例之外,并将激发声学传感领域令人兴奋的研究。
摘要:Passive human speed estimation plays a critical role in acoustic sensing. Despite extensive study, existing systems, however, suffer from various limitations: First, previous acoustic speed estimation exploits Doppler Frequency Shifts (DFS) created by moving targets and relies on microphone arrays, making them only capable of sensing the radial speed within a constrained distance. Second, the channel measurement rate proves inadequate to estimate high moving speeds. To overcome these issues, we present ASE, an accurate and robust Acoustic Speed Estimation system on a single commodity microphone. We model the sound propagation from a unique perspective of the acoustic diffusion field, and infer the speed from the acoustic spatial distribution, a completely different way of thinking about speed estimation beyond prior DFS-based approaches. We then propose a novel Orthogonal Time-Delayed Multiplexing (OTDM) scheme for acoustic channel estimation at a high rate that was previously infeasible, making it possible to estimate high speeds. We further develop novel techniques for motion detection and signal enhancement to deliver a robust and practical system. We implement and evaluate ASE through extensive real-world experiments. Our results show that ASE reliably tracks walking speed, independently of target location and direction, with a mean error of 0.13 m/s, a reduction of 2.5x from DFS, and a detection rate of 97.4% for large coverage, e.g., free walking in a 4m $\times$ 4m room. We believe ASE pushes acoustic speed estimation beyond the conventional DFS-based paradigm and will inspire exciting research in acoustic sensing.
标题:基于口腔关节的锚定改进跨数据库语音情感识别
链接:https://arxiv.org/abs/2412.19909
摘要:跨语料库语音情感识别在许多实际应用中起着至关重要的作用。传统的跨语料库情感迁移方法通常集中在调整声学特征以与不同的语料库、域或标签对齐。然而,由于扬声器差异、域偏移和记录条件等因素,声学特征固有地可变且容易出错。为了应对这些挑战,本研究采用了一种新的对比方法,专注于情感特定的发音手势作为分析的核心要素。通过将重点转移到更稳定和一致的发音手势上,我们的目标是增强SER任务中的情绪迁移学习。我们的研究利用CREMA-D和MSP-IMPROV语料库作为基准,它揭示了这些发音手势的共性和可靠性的宝贵见解。研究结果强调了嘴发音手势的潜力,作为一个更好的约束,以提高跨不同的设置或域的情感识别。
摘要:Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.