语音/音频处理学术速递[11.6]

文摘   2024-11-06 18:06   北京  
今日论文合集:cs.SD语音14篇,eess.AS音频处理8篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音

【1】 pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

标题:pTSE-T:使用未对齐文本线索的演示目标说话者提取
链接:https://arxiv.org/abs/2411.03109
作者:Ziyang Jiang,  Xinquan Qian,  Jiahe Lei,  Zexu Pan,  Wei Xue,  Xu-cheng Yin
摘要:TSE旨在从音频混合中提取目标说话者的清晰语音,从而消除不相关的背景噪音和语音。虽然先前的工作已经探索了各种辅助线索,包括预先录制的语音、视觉信息(例如,嘴唇运动和姿势)以及空间信息,但是在许多实际情况下,获取和选择这样的强线索是不可行的。与所有现有的工作,在本文中,我们的条件TSE算法从有限的和不对齐的文本内容,如浓缩点从演示幻灯片中提取的语义线索。这种方法在会议、海报会议或讲座演示等场景中特别有用,因为在这些场景中,实时获取其他线索具有挑战性。为此,我们设计了两个不同的网络。具体来说,我们提出的TPE融合音频功能与基于内容的语义线索,以方便时频掩码生成过滤掉外来噪声,而另一个建议,即TSR,采用对比学习技术,将盲分离的语音信号与语义线索。实验结果表明,该方法能够有效地利用有限文本和未对齐文本中的语义线索准确识别目标说话人,其SI-SDRi为12.16 dB,SDRi为12.66 dB,PESQi为0.830,STOIi为0.150。数据集和源代码将公开提供。项目演示页面:https://slideTSE.github.io/。
摘要:TSE aims to extract the clean speech of the target speaker in an audio mixture, thus eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information (e.g., lip motions and gestures), and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Unlike all existing work, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text content, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real-time is challenging. To this end, we design two different networks. Specifically, our proposed TPE fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise, while another proposal, namely TSR, employs the contrastive learning technique to associate blindly separated speech signals with semantic cues. The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150, respectively. Dataset and source code will be publicly available. Project demo page: https://slideTSE.github.io/.

【2】 Speech Separation with Pretrained Frontend to Minimize Domain Mismatch
标题:使用预训练前端进行语音分离以最大限度地减少域不匹配
链接:https://arxiv.org/abs/2411.03085
作者:Wupeng Wang,  Zexu Pan,  Xinke Li,  Shuai Wang,  Haizhou Li
备注:IEEE/ACM Transactions on Audio, Speech, and Language Processing
摘要:语音分离试图从语音混合物中分离单独的语音信号。通常情况下,大多数分离模型都是在合成数据上训练的,因为在现实世界的鸡尾酒会场景中无法获得目标参考。因此,在实际应用中部署语音分离模型时,真实数据和合成数据之间存在域差距。在本文中,我们提出了一个自监督的域不变预训练(DIP)前端,它暴露于混合数据,而不需要目标参考语音。DIP前端利用具有两个创新借口任务的Siamese网络,混合预测编码(MPC)和混合不变编码(MIC),以捕获真实和合成未标记混合物之间的共享上下文线索。随后,当在合成数据上训练下游语音分离模型时,我们将DIP前端冻结为特征提取器。通过使用上下文线索预训练DIP前端,我们希望从合成数据中学习的语音分离技能可以有效地转移到真实数据中。为了受益于DIP前端,我们引入了一种新的分离管道来对齐分离模型的特征分辨率。我们在标准基准和真实世界数据集上评估语音分离质量。结果证实了我们的DIP前端优于现有的语音分离模型。这项研究强调了大规模预训练的潜力,以提高语音分离在现实世界中的应用的质量和可懂度。
摘要:Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.

【3】 Real-Time Scream Detection and Position Estimation for Worker Safety in  Construction Sites
标题:建筑工地工人安全的实时尖叫检测和位置估计
链接:https://arxiv.org/abs/2411.03016
作者:Bikalpa Gautam,  Anmol Guragain,  Sarthak Giri
备注:12 pages, 14 figures, 1 table, submitted to AIRISE conference
摘要:由于事故频发,建筑业面临着高风险,往往使工人处于危险的境地,快速反应至关重要。传统的安全监测方法,包括可穿戴传感器和GPS,通常在障碍物或室内条件下失效。本研究介绍一种新的实时尖叫声检测和定位系统,为建筑工地量身定制,特别是在低资源环境中。集成Wav 2 Vec 2和增强型ConvNet模型以实现准确的尖叫检测,再加上GCC-PHAT算法以实现混响条件下的鲁棒时间延迟估计,然后采用基于梯度下降的方法来实现噪声环境中的精确位置估计。我们的方法结合了这些概念,以实现高检测精度和快速定位,从而最大限度地减少误报和优化应急响应。初步结果表明,该系统不仅可以准确地检测建筑噪音中的求救信号,而且可以可靠地识别呼叫者的位置。该解决方案代表了工人安全的重大改善,具有在高风险职业环境中广泛应用的潜力。用于训练、尖叫检测评估、位置估计和集成框架的脚本将在https://github.com/Anmol2059/construction_safety上发布。
摘要:The construction industry faces high risks due to frequent accidents, often leaving workers in perilous situations where rapid response is critical. Traditional safety monitoring methods, including wearable sensors and GPS, often fail under obstructive or indoor conditions. This research introduces a novel real-time scream detection and localization system tailored for construction sites, especially in low-resource environments. Integrating Wav2Vec2 and Enhanced ConvNet models for accurate scream detection, coupled with the GCC-PHAT algorithm for robust time delay estimation under reverberant conditions, followed by a gradient descent-based approach to achieve precise position estimation in noisy environments. Our approach combines these concepts to achieve high detection accuracy and rapid localization, thereby minimizing false alarms and optimizing emergency response. Preliminary results demonstrate that the system not only accurately detects distress calls amidst construction noise but also reliably identifies the caller's location. This solution represents a substantial improvement in worker safety, with the potential for widespread application across high-risk occupational environments. The scripts used for training, evaluation of scream detection, position estimation, and integrated framework will be released at: https://github.com/Anmol2059/construction_safety.

【4】 Speaker Emotion Recognition: Leveraging Self-Supervised Models for  Feature Extraction Using Wav2Vec2 and HuBERT
标题:说话者情绪识别:利用自监督模型使用Wav2 Vec 2和HuBERT进行特征提取
链接:https://arxiv.org/abs/2411.02964
作者:Pourya Jafarzadeh,  Amir Mohammad Rostami,  Padideh Choobdar
摘要:语言是人类表达自我最自然的方式。从语音中识别情感是一项重要的任务,因为情感本身的定义模糊不清。说话人情感识别是理解人类情感行为的关键。由于说话者的多样性、背景噪音、情绪的复杂性和说话风格,SER任务具有挑战性。它在教育,医疗保健,客户服务和人机交互(HCI)中有许多应用。以前,传统的机器学习方法,如SVM,HMM和KNN已被用于SER任务。近年来,深度学习方法变得越来越流行,卷积神经网络和递归神经网络被用于SER任务。这些方法的输入主要是频谱图和手工制作的特征。在这项工作中,我们研究了使用自我监督的基于变换的模型,Wav 2 Vec 2和HuBERT,从他们的声音来确定说话者的情绪。这些模型自动从原始音频信号中提取特征,然后用于分类任务。建议的解决方案在信誉良好的数据集上进行评估,包括RAVDESS,SHEMO,SAVEE,AESDD和ESD-DB。实验结果表明了该方法在不同数据集上的有效性。此外,该模型已被用于呼叫中心对话等现实世界的应用,结果表明,该模型准确地预测情绪。
摘要:Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.

【5】 Continual Audio-Visual Sound Separation
标题:连续视听声音分离
链接:https://arxiv.org/abs/2411.02860
作者:Weiguo Pian,  Yiyang Nan,  Shijian Deng,  Shentong Mo,  Yunhui Guo,  Yapeng Tian
备注:NeurIPS 2024
摘要:在本文中,我们介绍了一种新的连续的视听声音分离任务,目的是不断分离的声源为新的类,同时保留性能上以前学习的类,在视觉指导的帮助下。这个问题对于实际的视觉引导听觉感知至关重要,因为它可以显着增强视听声音分离模型的适应性和鲁棒性,使它们更适用于遇到新声源的现实世界场景。这项任务本身就具有挑战性,因为我们的模型不仅必须有效地利用当前任务中两种模态的信息,而且还必须在旧任务中保留它们的跨模态关联,以减轻视听持续学习过程中的灾难性遗忘。为了应对这些挑战,我们提出了一种新的方法,名为ContAV-Sep(\textbf{Cont}inual \textbf{A}udio-\textbf{V} issual Sound \textbf{Sep}aration)。ContAV-Sep提出了一种新的跨模态相似性蒸馏约束(CrossSDC),通过增量任务来维护跨模态语义相似性,并保留旧模型中先前获得的语义相似性知识,从而降低灾难性遗忘的风险。CrossSDC可以无缝集成到不同视听声音分离框架的训练过程中。实验表明,ContAV-Sep可以有效地减轻灾难性遗忘,并取得显着更好的性能相比,其他持续学习基线的视听声音分离。代码可在:\url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}获得。
摘要:In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.

【6】 Adversarial multi-task underwater acoustic target recognition: towards  robustness against various influential factors
标题:对抗性多任务水下声学目标识别:针对各种影响因素的鲁棒性
链接:https://arxiv.org/abs/2411.02848
作者:Yuan Xie,  Ji Xu,  Jiawei Ren,  Junfeng Li
摘要:基于被动声纳的水声目标识别在实际应用中面临着许多挑战。主要挑战之一在于信号特征对不同环境条件和数据采集配置的敏感性,这可能导致识别系统的不稳定性。虽然在水声学的其他领域中,人们一直致力于解决这些影响因素,但在水声目标识别领域中,它们往往被忽视。为了克服这一局限性,本研究设计了辅助任务,模拟影响因素(例如,源范围、水柱深度或风速),并采用多任务框架来将这些因素连接到识别任务。此外,我们将对抗性学习机制集成到多任务框架中,以促使模型提取对影响因素具有鲁棒性的表示。通过对ShipsEar数据集的广泛实验和分析,我们提出的对抗性多任务模型证明了其有效建模影响因素并在12类识别任务上实现最先进性能的能力。
摘要:Underwater acoustic target recognition based on passive sonar faces numerous challenges in practical maritime applications. One of the main challenges lies in the susceptibility of signal characteristics to diverse environmental conditions and data acquisition configurations, which can lead to instability in recognition systems. While significant efforts have been dedicated to addressing these influential factors in other domains of underwater acoustics, they are often neglected in the field of underwater acoustic target recognition. To overcome this limitation, this study designs auxiliary tasks that model influential factors (e.g., source range, water column depth, or wind speed) based on available annotations and adopts a multi-task framework to connect these factors to the recognition task. Furthermore, we integrate an adversarial learning mechanism into the multi-task framework to prompt the model to extract representations that are robust against influential factors. Through extensive experiments and analyses on the ShipsEar dataset, our proposed adversarial multi-task model demonstrates its capacity to effectively model the influential factors and achieve state-of-the-art performance on the 12-class recognition task.

【7】 Advancing Robust Underwater Acoustic Target Recognition through  Multi-task Learning and Multi-Gate Mixture-of-Experts
标题:通过多任务学习和多门混合专家推进稳健的水下声学目标识别
链接:https://arxiv.org/abs/2411.02787
作者:Yuan Xie,  Jiawei Ren,  Junfeng Li,  Ji Xu
摘要:水声目标识别是水声领域的一个重要研究方向。然而,真实的水下声信号记录的当前可用性仍然有限,这阻碍了数据驱动的声学识别模型从有限的一组复杂的水下信号中学习鲁棒的目标模式,从而损害了它们在实际应用中的稳定性。为了克服这些局限性,本研究提出了一个识别框架称为M3(多任务,多门,多专家),以提高模型的能力,通过使其了解目标的固有属性,捕捉强大的模式。在这个框架中,一个辅助任务,侧重于目标的属性,如估计目标的大小,被设计。辅助任务与识别任务共享参数,实现多任务学习。这种范式允许模型集中于跨任务的共享信息,并以正则化的方式识别目标的鲁棒模式,从而增强模型的泛化能力。此外,M3采用了多专家和多门机制,允许分配不同的参数空间,以各种水下信号。这使得模型能够以细粒度和差异化的方式处理复杂的信号模式。为了评估M3的有效性,在ShipsEar水下船舶辐射噪声数据集上进行了广泛的实验。实验结果表明,M3具有超越最先进的单任务识别模型的能力,从而实现了最先进的性能。
摘要:Underwater acoustic target recognition has emerged as a prominent research area within the field of underwater acoustics. However, the current availability of authentic underwater acoustic signal recordings remains limited, which hinders data-driven acoustic recognition models from learning robust patterns of targets from a limited set of intricate underwater signals, thereby compromising their stability in practical applications. To overcome these limitations, this study proposes a recognition framework called M3 (Multi-task, Multi-gate, Multi-expert) to enhance the model's ability to capture robust patterns by making it aware of the inherent properties of targets. In this framework, an auxiliary task that focuses on target properties, such as estimating target size, is designed. The auxiliary task then shares parameters with the recognition task to realize multi-task learning. This paradigm allows the model to concentrate on shared information across tasks and identify robust patterns of targets in a regularized manner, thereby enhancing the model's generalization ability. Moreover, M3 incorporates multi-expert and multi-gate mechanisms, allowing for the allocation of distinct parameter spaces to various underwater signals. This enables the model to process intricate signal patterns in a fine-grained and differentiated manner. To evaluate the effectiveness of M3, extensive experiments were implemented on the ShipsEar underwater ship-radiated noise dataset. The results substantiate that M3 has the ability to outperform the most advanced single-task recognition models, thereby achieving the state-of-the-art performance.

【8】 DEMONet: Underwater Acoustic Target Recognition based on Multi-Expert  Network and Cross-Temporal Variational Autoencoder
标题:DEMONet:基于多专家网络和跨时间变分自动编码器的水下声学目标识别
链接:https://arxiv.org/abs/2411.02758
作者:Yuan Xie,  Xiaowei Zhang,  Jiawei Ren,  Ji Xu
摘要:由于水下环境的复杂性和目标的动态运动状态,在现实场景中建立鲁棒的水下声学识别系统具有挑战性。一种有前途的优化方法是利用目标的固有物理特性,无论环境条件如何,这些特性都保持不变,以提供强大的见解。然而,我们的研究表明,虽然物理特征表现出强大的特性,但它们可能缺乏特定于类别的区分模式。因此,直接将物理特性纳入模型训练可能会引入意想不到的归纳偏差,导致性能下降。为了利用物理特性的好处,同时减轻可能的不利影响,我们提出了DEMONet在这项研究中,它利用检测噪声的包络调制(DEMON)提供强大的洞察到轴频率或叶片计数的目标。DEMONet是一个多专家网络,它根据DEMON谱将各种水下信号分配到最匹配的专家层,进行细粒度的信号处理。其中,DEMON谱只负责提供隐含的物理特征,而不与目标类别建立映射关系。此外,为了减轻DEMON特征中的噪声和杂散调制谱,我们引入了跨时间对齐策略,并采用变分自动编码器(VAE)来重建抗噪声的DEMON谱,以取代原始的DEMON特征。所提出的具有跨时间VAE的DEMONet的有效性主要在DeepShip数据集和我们的专有数据集上进行了评估。实验结果表明,我们的方法可以在两个数据集上实现最先进的性能。
摘要:Building a robust underwater acoustic recognition system in real-world scenarios is challenging due to the complex underwater environment and the dynamic motion states of targets. A promising optimization approach is to leverage the intrinsic physical characteristics of targets, which remain invariable regardless of environmental conditions, to provide robust insights. However, our study reveals that while physical characteristics exhibit robust properties, they may lack class-specific discriminative patterns. Consequently, directly incorporating physical characteristics into model training can potentially introduce unintended inductive biases, leading to performance degradation. To utilize the benefits of physical characteristics while mitigating possible detrimental effects, we propose DEMONet in this study, which utilizes the detection of envelope modulation on noise (DEMON) to provide robust insights into the shaft frequency or blade counts of targets. DEMONet is a multi-expert network that allocates various underwater signals to their best-matched expert layer based on DEMON spectra for fine-grained signal processing. Thereinto, DEMON spectra are solely responsible for providing implicit physical characteristics without establishing a mapping relationship with the target category. Furthermore, to mitigate noise and spurious modulation spectra in DEMON features, we introduce a cross-temporal alignment strategy and employ a variational autoencoder (VAE) to reconstruct noise-resistant DEMON spectra to replace the raw DEMON features. The effectiveness of the proposed DEMONet with cross-temporal VAE was primarily evaluated on the DeepShip dataset and our proprietary datasets. Experimental results demonstrated that our approach could achieve state-of-the-art performance on both datasets.

【9】 Self-Supervised Multi-View Learning for Disentangled Music Audio  Representations
标题:用于分解音乐音频表示的自我监督多视图学习
链接:https://arxiv.org/abs/2411.02711
作者:Julia Wilkins,  Sivan Ding,  Magdalena Fuentes,  Juan Pablo Bello
备注:Late Breaking Demo at ISMIR 2024. this https URL
摘要:自监督学习(SSL)提供了一种强大的方法来学习鲁棒的,可推广的表示,而无需标记数据。在音乐中,标记数据很少,现有的SSL方法通常使用生成的监督和多视图冗余来创建借口任务。然而,这些方法通常会产生纠缠表示,并丢失特定于视图的信息。我们提出了一种新的自监督多视图学习框架,旨在激励私人和共享表示空间之间的分离。一个在受控环境中的音频解缠的案例研究证明了我们的方法的有效性。
摘要:Self-supervised learning (SSL) offers a powerful way to learn robust, generalizable representations without labeled data. In music, where labeled data is scarce, existing SSL methods typically use generated supervision and multi-view redundancy to create pretext tasks. However, these approaches often produce entangled representations and lose view-specific information. We propose a novel self-supervised multi-view learning framework for audio designed to incentivize separation between private and shared representation spaces. A case study on audio disentanglement in a controlled setting demonstrates the effectiveness of our method.

【10】 EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via  Emotion-Adaptive Spherical Vector
标题:DeliverGlobe ++:通过描述自适应球形载体的描述可控的Zero-Shot文本到语音
链接:https://arxiv.org/abs/2411.02625
作者:Deok-Hyeon Cho,  Hyung-Seok Oh,  Seung-Bin Kim,  Seong-Whan Lee
摘要:近年来,情感语音合成(TTS)技术取得了长足的进步,但由于情感本身的复杂性以及现有情感语音数据集和模型的局限性,其面临的挑战依然存在。以前的研究通常依赖于有限的情感语音数据集或需要大量的手动注释,限制了他们对不同说话者和情感风格进行概括的能力。在本文中,我们提出了一个情感可控的zero-shot TTS模型,可以控制情感的风格和强度,以模仿自然的人类语音。我们引入了一种新的情感自适应球形向量模型的情感风格和强度没有人的注释。此外,我们提出了一个多层次的风格编码器,可以确保有效的推广,看到和看不见的扬声器。我们还引入了额外的损失函数,以提高情感转移性能的zero-shot场景。我们采用了一个条件流匹配为基础的解码器,以实现高质量和表达情感的TTS在几个采样步骤。实验结果证明了该框架的有效性。
摘要:Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

【11】 Estimating the Number and Locations of Boundaries in Reverberant  Environments with Deep Learning
标题:利用深度学习估计回响环境中边界的数量和位置
链接:https://arxiv.org/abs/2411.02609
作者:Toros Arikan,  Luca M. Chackalackal,  Fatima Ahsan,  Konrad Tittel,  Andrew C. Singer,  Gregory W. Wornell,  Richard G. Baraniuk
摘要:水声环境估计是遥感领域的一项重要而又具有挑战性的任务。当前的估计方法需要高信号强度和对脆弱回波标记问题的有效解决方案。在以前的出版物中,我们提出了一种基于深度学习的二维环境估计方法,无论是在模拟还是在现实实验环境中,该方法都优于最先进的方法。这种方法的局限性在于,用户必须提供关于反射边界的数量和位置的一些先验信息,并且其神经网络必须针对不同的环境进行相应的重新训练。利用更先进的神经网络和时间延迟估计技术,所提出的改进方法不再需要先验知识的边界的数量或其位置,并能够估计二维环境的一个或两个边界。未来的工作将扩展所提出的方法,更多的边界和更大规模的环境。
摘要:Underwater acoustic environment estimation is a challenging but important task for remote sensing scenarios. Current estimation methods require high signal strength and a solution to the fragile echo labeling problem to be effective. In previous publications, we proposed a general deep learning-based method for two-dimensional environment estimation which outperformed the state-of-the-art, both in simulation and in real-life experimental settings. A limitation of this method was that some prior information had to be provided by the user on the number and locations of the reflective boundaries, and that its neural networks had to be re-trained accordingly for different environments. Utilizing more advanced neural network and time delay estimation techniques, the proposed improved method no longer requires prior knowledge the number of boundaries or their locations, and is able to estimate two-dimensional environments with one or two boundaries. Future work will extend the proposed method to more boundaries and larger-scale environments.

【12】 PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
标题:PIAST:包含音频、符号和文本的多模式钢琴数据集
链接:https://arxiv.org/abs/2411.02551
作者:Hayeon Bang,  Eunjin Choi,  Megan Finch,  Seungheon Doh,  Seolhee Lee,  Gyeong-Hoon Lee,  Juan Nam
备注:Accepted for publication at the 3rd Workshop on NLP for Music and Audio (NLP4MusA 2024)
摘要:虽然钢琴音乐已经成为音乐信息检索(MIR)的一个重要研究领域,但显着缺乏带有文本标签的钢琴独奏音乐数据集。为了解决这一差距,我们提出了PIAST(具有音频,符号和文本的钢琴数据集),一个钢琴音乐数据集。利用钢琴特定的语义标签分类,我们从YouTube上收集了9,673首曲目,并为音乐专家的2,023首曲目添加了人工注释,产生了两个子集:PIAST-YT和PIAST-AT。两者都包括音频,文本,标签注释,并转录利用国家的最先进的钢琴转录和节拍跟踪模型。在多模态数据集的许多可能的任务中,我们使用音频和音频数据进行音乐标记和检索,并报告基线性能,以证明其作为MIR研究的宝贵资源的潜力。
摘要:While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

【13】 Optimal Transport Maps are Good Voice Converters
标题:最佳交通地图是良好的语音转换器
链接:https://arxiv.org/abs/2411.02402
作者:Arip Asadulaev,  Rostislav Korst,  Vitalii Shutov,  Alexander Korotin,  Yaroslav Grebnyak,  Vahe Egiazarian,  Evgeny Burnaev
摘要:最近,基于神经网络的方法来计算最优运输地图已被有效地应用到风格转移问题。然而,这些方法在语音转换中的应用还没有得到充分的研究。在我们的论文中,我们填补了这一空白,调查最佳运输作为语音转换的框架。我们提出了各种最佳的传输算法,设计用于不同的数据表示,如梅尔频谱图和潜在的自我监督语音模型的表示。对于梅尔频谱图数据表示,我们取得了很好的结果,在弗雷歇音频距离(FAD)。这种性能与我们的理论分析是一致的,这表明我们的方法提供了目标和生成的分布之间的FAD上界。在WavLM编码器的潜在空间内,我们实现了最先进的结果,即使在参考扬声器数据有限的情况下也优于现有方法。
摘要:Recently, neural network-based methods for computing optimal transport maps have been effectively applied to style transfer problems. However, the application of these methods to voice conversion is underexplored. In our paper, we fill this gap by investigating optimal transport as a framework for voice conversion. We present a variety of optimal transport algorithms designed for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models. For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD). This performance is consistent with our theoretical analysis, which suggests that our method provides an upper bound on the FAD between the target and generated distributions. Within the latent space of the WavLM encoder, we achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.

【14】 Blind Estimation of Sub-band Acoustic Parameters from Ambisonics  Recordings using Spectro-Spatial Covariance Features
标题:利用谱空间协方差特征盲估计立体声声响声记录的子带声学参数
链接:https://arxiv.org/abs/2411.03172
作者:Hanyu Meng,  Jeroen Breebaart,  Jeremy Stoddard,  Vidhyasaharan Sethu,  Eliathamby Ambikairajah
备注:Submitted to ICASSP2025
摘要:频率变化的声学参数的估计是必不可少的,以增强沉浸式感知在现实的空间音频创作。在本文中,我们提出了一个统一的框架,盲估计混响时间(T60),直接混响比(DRR),和清晰度(C50)在10个频带使用一阶高保真度立体声(FOA)语音录音作为输入。该框架利用了一种新的功能命名为频谱空间协方差向量(SSCV),有效地表示时间,频谱以及空间信息的FOA信号。我们的模型显着优于现有的单通道方法,只有频谱信息,减少了一半以上的所有三个声学参数的估计误差。此外,我们还介绍了FOA-Conv 3D,这是一种新型的后端网络,用于有效地利用3D卷积编码器的SSCV功能。FOA-Conv 3D的性能优于卷积神经网络(CNN)和递归卷积神经网络(CRNN)后端,实现了更低的估计误差,并为所有3个声学参数提供了更高的方差比例(PoV)。
摘要:Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.

eess.AS音频处理

【1】 Blind Estimation of Sub-band Acoustic Parameters from Ambisonics  Recordings using Spectro-Spatial Covariance Features
标题:利用谱空间协方差特征盲估计立体声声响声记录的子带声学参数
链接:https://arxiv.org/abs/2411.03172
作者:Hanyu Meng,  Jeroen Breebaart,  Jeremy Stoddard,  Vidhyasaharan Sethu,  Eliathamby Ambikairajah
备注:Submitted to ICASSP2025
摘要:频率变化的声学参数的估计是必不可少的,以增强沉浸式感知在现实的空间音频创作。在本文中,我们提出了一个统一的框架,盲估计混响时间(T60),直接混响比(DRR),和清晰度(C50)在10个频带使用一阶高保真度立体声(FOA)语音录音作为输入。该框架利用了一种新的功能命名为频谱空间协方差向量(SSCV),有效地表示时间,频谱以及空间信息的FOA信号。我们的模型显着优于现有的单通道方法,只有频谱信息,减少了一半以上的所有三个声学参数的估计误差。此外,我们还介绍了FOA-Conv 3D,这是一种新型的后端网络,用于有效地利用3D卷积编码器的SSCV功能。FOA-Conv 3D的性能优于卷积神经网络(CNN)和递归卷积神经网络(CRNN)后端,实现了更低的估计误差,并为所有3个声学参数提供了更高的方差比例(PoV)。
摘要:Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.

【2】 Reference Microphone Selection for the Weighted Prediction Error  Algorithm using the Normalized L-p Norm
标题:使用标准化L-p规范的加权预测误差算法的参考麦克风选择
链接:https://arxiv.org/abs/2411.03168
作者:Anselm Lohmann,  Toon van Waterschoot,  Joerg Bitzer,  Simon Doclo
摘要:混响可能严重降低使用房间中的麦克风记录的语音信号的质量。对于紧凑型麦克风阵列,用于多麦克风去混响的参考麦克风的选择通常对去混响性能没有大的影响。相反,当麦克风在空间上分布时,参考麦克风的选择可以显著地有助于去混响性能。在本文中,我们提出了执行参考麦克风选择加权预测误差(WPE)去混响算法的基础上去混响输出信号的归一化$\ell_p$-范数。混响实验室中不同声源位置的实验结果表明,该方法比基于前后混响比或信号功率的参考传声器选择方法具有更好的去混响效果。
摘要:Reverberation may severely degrade the quality of speech signals recorded using microphones in a room. For compact microphone arrays, the choice of the reference microphone for multi-microphone dereverberation typically does not have a large influence on the dereverberation performance. In contrast, when the microphones are spatially distributed, the choice of the reference microphone may significantly contribute to the dereverberation performance. In this paper, we propose to perform reference microphone selection for the weighted prediction error (WPE) dereverberation algorithm based on the normalized $\ell_p$-norm of the dereverberated output signal. Experimental results for different source positions in a reverberant laboratory show that the proposed method yields a better dereverberation performance than reference microphone selection based on the early-to-late reverberation ratio or signal power.

【3】 Noise-Robust Hearing Aid Voice Control
标题:降噪助听器语音控制
链接:https://arxiv.org/abs/2411.03150
作者:Iván López-Espejo,  Eros Roselló,  Amin Edraki,  Naomi Harte,  Jesper Jensen
备注:Submitted to IEEE Signal Processing Letters
摘要:提高助听器(HA)语音控制的设计对于提高重听人群的HA使用率以及改善HA用户体验至关重要。在这项工作中,我们通过以下方式为实现这一目标做出贡献:首先,提出一种新型的HA语音数据集,由2个耳后式(BTE)和1个耳道式(IEC)麦克风捕获的嘈杂的自己的声音组成。第二,我们提供了基线HA语音控制结果的评估轻,国家的最先进的关键字定位模型,利用不同的组合HA麦克风信号。实验结果表明,利用带宽有限的骨传导语音(BCS)从IEC麦克风实现噪声鲁棒HA语音控制的好处。此外,结果还表明,语音控制性能可以提高通过辅助BCS的带宽BTE麦克风信号。为了设定科学界可以继续进步的基线,HA噪声语音数据集已公开提供。
摘要:Advancing the design of robust hearing aid (HA) voice control is crucial to increase the HA use rate among hard of hearing people as well as to improve HA users' experience. In this work, we contribute towards this goal by, first, presenting a novel HA speech dataset consisting of noisy own voice captured by 2 behind-the-ear (BTE) and 1 in-ear-canal (IEC) microphones. Second, we provide baseline HA voice control results from the evaluation of light, state-of-the-art keyword spotting models utilizing different combinations of HA microphone signals. Experimental results show the benefits of exploiting bandwidth-limited bone-conducted speech (BCS) from the IEC microphone to achieve noise-robust HA voice control. Furthermore, results also demonstrate that voice control performance can be boosted by assisting BCS by the broader-bandwidth BTE microphone signals. Aiming at setting a baseline upon which the scientific community can continue to progress, the HA noisy speech dataset has been made publicly available.

【4】 Unsupervised detection and classification of heartbeats using the  dissimilarity matrix in PCG signals
标题:使用PCG信号中的相异性矩阵对心跳进行无监督检测和分类
链接:https://arxiv.org/abs/2411.03061
作者:J. Torre-Cruz,  D. Martinez-Munoz,  N. Ruiz-Reyes,  A.J. Munoz-Montoro,  M. Puentes-Chiachio,  F.J. Canadas-Quesada
备注:None
摘要:该系统由两级级联组成。第一阶段执行粗略的心跳检测,而第二阶段细化先前的心跳检测,从而改进时间定位并且还将心跳分类为类型S1和S2。第一个贡献是一种新的方法,它结合了相异性矩阵与帧级频谱发散定位心跳使用的重复性所示的心音和事件S1/S2和非S1/S2(收缩期和舒张期)定义的间隔之间的时间关系。第二个贡献是基于滑动窗口的验证-校正-分类过程,该滑动窗口允许保存心动周期的时间结构,以便应用于心音分类。所提出的方法进行了评估,使用开放存取数据库PASCAL,CirCor DigiScope Phonocardiogram和一个额外的声音混合过程,考虑到加性高斯白噪声(AWGN)和不同种类的临床环境噪声从商业数据库。所提出的方法提供了最佳的检测/分类性能在现实的情况下,心脏异常的存在,以及不同类型的临床环境噪声是活跃的PCG信号。值得注意的是,由相异性矩阵提供的心脏的时间结构的有前途的建模连同帧级谱发散,以及去除大量的假心脏事件和恢复丢失的心脏事件,两者都由所提出的验证-校正-分类算法校正,表明我们的提议是应用于心脏分割的成功工具。
摘要:The proposed system consists of a two-stage cascade. The first stage performs a rough heartbeat detection while the second stage refines the previous one, improving the temporal localization and also classifying the heartbeats into types S1 and S2. The first contribution is a novel approach that combines the dissimilarity matrix with the frame-level spectral divergence to locate heartbeats using the repetitiveness shown by the heart sounds and the temporal relationships between the intervals defined by the events S1/S2 and non-S1/S2 (systole and diastole). The second contribution is a verification-correction-classification process based on a sliding window that allows the preservation of the temporal structure of the cardiac cycle in order to be applied in the heart sound classification. The proposed method has been assessed using the open access databases PASCAL, CirCor DigiScope Phonocardiogram and an additional sound mixing procedure considering both Additive White Gaussian Noise (AWGN) and different kinds of clinical ambient noises from a commercial database. The proposed method provides the best detection/classification performance in realistic scenarios where the presence of cardiac anomalies as well as different types of clinical environmental noises are active in the PCG signal. Of note, the promising modelling of the temporal structures of the heart provided by the dissimilarity matrix together with the frame-level spectral divergence, as well as the removal of a significant number of spurious heart events and recovery of missing heart events, both corrected by the proposed verification-correction-classification algorithm, suggest that our proposal is a successful tool to be applied in heart segmentation.

【5】 Real-Time Scream Detection and Position Estimation for Worker Safety in  Construction Sites
标题:建筑工地工人安全的实时尖叫检测和位置估计
链接:https://arxiv.org/abs/2411.03016
作者:Bikalpa Gautam,  Anmol Guragain,  Sarthak Giri
备注:12 pages, 14 figures, 1 table, submitted to AIRISE conference
摘要:由于事故频发,建筑业面临着高风险,往往使工人处于危险的境地,快速反应至关重要。传统的安全监测方法,包括可穿戴传感器和GPS,通常在障碍物或室内条件下失效。本研究介绍一种新的实时尖叫声检测和定位系统,为建筑工地量身定制,特别是在低资源环境中。集成Wav 2 Vec 2和增强型ConvNet模型以实现准确的尖叫检测,再加上GCC-PHAT算法以实现混响条件下的鲁棒时间延迟估计,然后采用基于梯度下降的方法来实现噪声环境中的精确位置估计。我们的方法结合了这些概念,以实现高检测精度和快速定位,从而最大限度地减少误报和优化应急响应。初步结果表明,该系统不仅可以准确地检测建筑噪音中的求救信号,而且可以可靠地识别呼叫者的位置。该解决方案代表了工人安全的重大改善,具有在高风险职业环境中广泛应用的潜力。用于训练、尖叫检测评估、位置估计和集成框架的脚本将在https://github.com/Anmol2059/construction_safety上发布。
摘要:The construction industry faces high risks due to frequent accidents, often leaving workers in perilous situations where rapid response is critical. Traditional safety monitoring methods, including wearable sensors and GPS, often fail under obstructive or indoor conditions. This research introduces a novel real-time scream detection and localization system tailored for construction sites, especially in low-resource environments. Integrating Wav2Vec2 and Enhanced ConvNet models for accurate scream detection, coupled with the GCC-PHAT algorithm for robust time delay estimation under reverberant conditions, followed by a gradient descent-based approach to achieve precise position estimation in noisy environments. Our approach combines these concepts to achieve high detection accuracy and rapid localization, thereby minimizing false alarms and optimizing emergency response. Preliminary results demonstrate that the system not only accurately detects distress calls amidst construction noise but also reliably identifies the caller's location. This solution represents a substantial improvement in worker safety, with the potential for widespread application across high-risk occupational environments. The scripts used for training, evaluation of scream detection, position estimation, and integrated framework will be released at: https://github.com/Anmol2059/construction_safety.

【6】 EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via  Emotion-Adaptive Spherical Vector
标题:DeliverGlobe ++:通过描述自适应球形载体的描述可控的Zero-Shot文本到语音
链接:https://arxiv.org/abs/2411.02625
作者:Deok-Hyeon Cho,  Hyung-Seok Oh,  Seung-Bin Kim,  Seong-Whan Lee
摘要:近年来,情感语音合成(TTS)技术取得了长足的进步,但由于情感本身的复杂性以及现有情感语音数据集和模型的局限性,其面临的挑战依然存在。以前的研究通常依赖于有限的情感语音数据集或需要大量的手动注释,限制了他们对不同说话者和情感风格进行概括的能力。在本文中,我们提出了一个情感可控的zero-shot TTS模型,可以控制情感的风格和强度,以模仿自然的人类语音。我们引入了一种新的情感自适应球形向量模型的情感风格和强度没有人的注释。此外,我们提出了一个多层次的风格编码器,可以确保有效的推广,看到和看不见的扬声器。我们还引入了额外的损失函数,以提高情感转移性能的zero-shot场景。我们采用了一个条件流匹配为基础的解码器,以实现高质量和表达情感的TTS在几个采样步骤。实验结果证明了该框架的有效性。
摘要:Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

【7】 Estimating the Number and Locations of Boundaries in Reverberant  Environments with Deep Learning
标题:利用深度学习估计回响环境中边界的数量和位置
链接:https://arxiv.org/abs/2411.02609
作者:Toros Arikan,  Luca M. Chackalackal,  Fatima Ahsan,  Konrad Tittel,  Andrew C. Singer,  Gregory W. Wornell,  Richard G. Baraniuk
摘要:水声环境估计是遥感领域的一项重要而又具有挑战性的任务。当前的估计方法需要高信号强度和对脆弱回波标记问题的有效解决方案。在以前的出版物中,我们提出了一种基于深度学习的二维环境估计方法,无论是在模拟还是在现实实验环境中,该方法都优于最先进的方法。这种方法的局限性在于,用户必须提供关于反射边界的数量和位置的一些先验信息,并且其神经网络必须针对不同的环境进行相应的重新训练。利用更先进的神经网络和时间延迟估计技术,所提出的改进方法不再需要先验知识的边界的数量或其位置,并能够估计二维环境的一个或两个边界。未来的工作将扩展所提出的方法,更多的边界和更大规模的环境。
摘要:Underwater acoustic environment estimation is a challenging but important task for remote sensing scenarios. Current estimation methods require high signal strength and a solution to the fragile echo labeling problem to be effective. In previous publications, we proposed a general deep learning-based method for two-dimensional environment estimation which outperformed the state-of-the-art, both in simulation and in real-life experimental settings. A limitation of this method was that some prior information had to be provided by the user on the number and locations of the reflective boundaries, and that its neural networks had to be re-trained accordingly for different environments. Utilizing more advanced neural network and time delay estimation techniques, the proposed improved method no longer requires prior knowledge the number of boundaries or their locations, and is able to estimate two-dimensional environments with one or two boundaries. Future work will extend the proposed method to more boundaries and larger-scale environments.

【8】 Optimal Transport Maps are Good Voice Converters
标题:最佳交通地图是良好的语音转换器
链接:https://arxiv.org/abs/2411.02402
作者:Arip Asadulaev,  Rostislav Korst,  Vitalii Shutov,  Alexander Korotin,  Yaroslav Grebnyak,  Vahe Egiazarian,  Evgeny Burnaev
摘要:最近,基于神经网络的方法来计算最优运输地图已被有效地应用到风格转移问题。然而,这些方法在语音转换中的应用还没有得到充分的研究。在我们的论文中,我们填补了这一空白,调查最佳运输作为语音转换的框架。我们提出了各种最佳的传输算法,设计用于不同的数据表示,如梅尔频谱图和潜在的自我监督语音模型的表示。对于梅尔频谱图数据表示,我们取得了很好的结果,在弗雷歇音频距离(FAD)。这种性能与我们的理论分析是一致的,这表明我们的方法提供了目标和生成的分布之间的FAD上界。在WavLM编码器的潜在空间内,我们实现了最先进的结果,即使在参考扬声器数据有限的情况下也优于现有方法。
摘要:Recently, neural network-based methods for computing optimal transport maps have been effectively applied to style transfer problems. However, the application of these methods to voice conversion is underexplored. In our paper, we fill this gap by investigating optimal transport as a framework for voice conversion. We present a variety of optimal transport algorithms designed for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models. For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD). This performance is consistent with our theoretical analysis, which suggests that our method provides an upper bound on the FAD between the target and generated distributions. Within the latent space of the WavLM encoder, we achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.

机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区


语音之家
助力AI语音开发者的社区
 最新文章