语音/音频处理学术速递[1.6]

文摘 2025-01-06 18:01 河北

今日论文合集：cs.SD语音10篇，eess.AS音频处理10篇。

本文经arXiv每日学术速递授权转载

微信公众号：arXiv_Daily

cs.SD语音

【1】VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

标题：VitA-1.5：迈向GPT-4 o级实时视觉和语音交互
链接：https://arxiv.org/abs/2501.01957

作者：Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
备注：this https URL
摘要：最近的多模态大型语言模型（MLLM）通常专注于整合视觉和文本模态，较少强调语音在增强交互中的作用。然而，语音在多模态对话系统中起着至关重要的作用，由于基本的模态差异，在视觉和语音任务中实现高性能仍然是一个重大挑战。在本文中，我们提出了一种精心设计的多阶段训练方法，逐步训练LLM理解视觉和语音信息，最终实现流畅的视觉和语音交互。我们的方法不仅保留了强大的视觉语言能力，而且在没有单独的ASR和TTS模块的情况下实现了高效的语音到语音对话功能，大大加快了多模态端到端响应速度。通过将我们的方法与图像、视频和语音任务的基准测试中最先进的方法进行比较，我们证明了我们的模型具有强大的视觉和语音功能，可以实现近乎实时的视觉和语音交互。
摘要：Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

【2】 Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) for Passive Sonar Classification
标题：用于被动声纳分类的结构和统计音频纹理知识提取（SATKD）
链接：https://arxiv.org/abs/2501.01921

作者：Jarin Ritu, Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples
备注：13 pages, 6 figures, submitted for review
摘要：知识提取已成功地应用于各种音频任务，但其在水下被动声纳目标分类的潜力仍然相对未开发。现有的方法往往侧重于高层次的上下文信息，而忽略了必要的低层次的音频纹理特征，需要捕捉局部模式的声纳数据。为了解决这一问题，提出了结构和统计音频纹理知识提取（SSATKD）框架的被动声纳目标分类。SSATKD通过利用用于结构纹理提取的边缘检测模块和用于捕获信号可变性和分布的统计知识提取器模块来将高级上下文信息与低级音频纹理相结合。实验结果证实，SSATKD提高分类精度，同时优化内存和计算资源，使其非常适合资源受限的环境。
摘要：Knowledge distillation has been successfully applied to various audio tasks, but its potential in underwater passive sonar target classification remains relatively unexplored. Existing methods often focus on high-level contextual information while overlooking essential low-level audio texture features needed to capture local patterns in sonar data. To address this gap, the Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework is proposed for passive sonar target classification. SSATKD combines high-level contextual information with low-level audio textures by utilizing an Edge Detection Module for structural texture extraction and a Statistical Knowledge Extractor Module to capture signal variability and distribution. Experimental results confirm that SSATKD improves classification accuracy while optimizing memory and computational resources, making it well-suited for resource-constrained environments.

【3】 CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation
标题：CycleFlow：利用流量匹配中的循环一致性来适应说话者风格
链接：https://arxiv.org/abs/2501.01861

作者：Ziqi Liang, Xulong Zhang, Chang Liu, Xiaoyang Qu, Weifeng Zhao, Jianzong Wang
备注：Accepted by 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2025)
摘要：语音转换（VC）旨在将源说话人的风格（如音色和音高）转换为任何目标说话人的风格，同时保留语言内容。然而，在非并行VC场景中，转换后的语音的地面真值不存在，这导致了训练推理不匹配问题。此外，现有的方法仍然存在基音不准确和说话人自适应质量低的问题，源说话人风格域和目标说话人风格域之间的基音存在显著差异。因此，这些模型往往会生成声音嘶哑的语音，这对实现高质量的语音转换提出了挑战。在这项研究中，我们提出了CycleFlow，一种新的VC方法，利用循环的一致性条件流匹配（CFM）的扬声器音色自适应训练的非并行数据。此外，我们设计了一个基于VoiceCFM和PitchCFM的Dual-CFM来生成语音，提高说话人基音自适应质量。实验表明，该方法能显著提高说话人相似度，生成自然、高质量的语音。
摘要：Voice Conversion (VC) aims to convert the style of a source speaker, such as timbre and pitch, to the style of any target speaker while preserving the linguistic content. However, the ground truth of the converted speech does not exist in a non-parallel VC scenario, which induces the train-inference mismatch problem. Moreover, existing methods still have an inaccurate pitch and low speaker adaptation quality, there is a significant disparity in pitch between the source and target speaker style domains. As a result, the models tend to generate speech with hoarseness, posing challenges in achieving high-quality voice conversion. In this study, we propose CycleFlow, a novel VC approach that leverages cycle consistency in conditional flow matching (CFM) for speaker timbre adaptation training on non-parallel data. Furthermore, we design a Dual-CFM based on VoiceCFM and PitchCFM to generate speech and improve speaker pitch adaptation quality. Experiments show that our method can significantly improve speaker similarity, generating natural and higher-quality speech.

【4】 MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling
标题：MusicGen-Stem：通过自回归建模生成多干音乐
链接：https://arxiv.org/abs/2501.01757

作者：Simon Rouard, Robin San Roman, Yossi Adi, Axel Roebel
备注：5 pages, 3 figures, accepted to ICASSP 2025
摘要：虽然大多数音乐生成模型生成混合的干（单声道或立体声），我们建议训练一个多干生成模型与3干（低音，鼓和其他），学习它们之间的音乐依赖关系。为此，我们为每个词干训练一个专门的压缩算法，将音乐标记为并行的标记流。然后，我们利用音乐源分离任务的最新改进来在大型数据集上训练多流文本到音乐语言模型。最后，由于特定的条件反射方法，我们的模型能够在现有或生成的歌曲上编辑低音，鼓或其他干，以及进行迭代合成（例如在现有鼓上生成低音）。这为音乐生成算法提供了更大的灵活性，据我们所知，这是第一个开源的多干自回归音乐生成模型，可以执行高质量的生成和连贯的源编辑。代码和模型重量将被释放，样品可在https://simonrouard.github.io/musicgenstem/上获得。
摘要：While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music source separation to train a multi-stream text-to-music language model on a large dataset. Finally, thanks to a particular conditioning method, our model is able to edit bass, drums or other stems on existing or generated songs as well as doing iterative composition (e.g. generating bass on top of existing drums). This gives more flexibility in music generation algorithms and it is to the best of our knowledge the first open-source multi-stem autoregressive music generation model that can perform good quality generation and coherent source editing. Code and model weights will be released and samples are available on https://simonrouard.github.io/musicgenstem/.

【5】 Controlling your Attributes in Voice
标题：控制语音中的属性
链接：https://arxiv.org/abs/2501.01674

作者：Xuyuan Li, Zengqiang Shang.Li Wang, Pengyuan Zhang
备注：5 pages, 3 figures
摘要：生成任务中的属性控制旨在修改个人属性，如年龄和性别，同时保留源样本中的身份信息。虽然在图像生成中控制面部属性方面已经取得了重大进展，但用于语音生成的类似方法在很大程度上仍未被探索。这封信提出了一种新的方法来控制说话人属性的语音没有并行数据。我们的方法由两个主要部分组成：一个基于GAN的说话人表示变分自动编码器，它从说话人向量中提取说话人身份和属性，以及一个两阶段的语音转换模型，它捕获语音中说话人属性的自然表达。实验结果表明，我们提出的方法不仅实现了属性控制的发言人表示水平，但也使操纵的发言人的年龄和性别在语音水平，同时保持语音质量和发言人身份。
摘要：Attribute control in generative tasks aims to modify personal attributes, such as age and gender while preserving the identity information in the source sample. Although significant progress has been made in controlling facial attributes in image generation, similar approaches for speech generation remain largely unexplored. This letter proposes a novel method for controlling speaker attributes in speech without parallel data. Our approach consists of two main components: a GAN-based speaker representation variational autoencoder that extracts speaker identity and attributes from speaker vector, and a two-stage voice conversion model that captures the natural expression of speaker attributes in speech. Experimental results show that our proposed method not only achieves attribute control at the speaker representation level but also enables manipulation of the speaker age and gender at the speech level while preserving speech quality and speaker identity.

【6】 Improved Feature Extraction Network for Neuro-Oriented Target Speaker Extraction
标题：面向神经的目标说话人提取的改进特征提取网络
链接：https://arxiv.org/abs/2501.01673

作者：Cunhang Fan, Youdian Gao, Zexu Pan, Jingjing Zhang, Hongyu Zhang, Jie Zhang, Zhao Lv
备注：accepted by ICASSP2025
摘要：近年来听觉注意解码（AAD）技术的迅速发展为利用脑电（EEG）作为辅助信息进行目标说话人提取提供了可能。然而，有效地建模长序列的语音和解决从EEG信号的目标说话人的身份仍然是一个重大的挑战。本文提出了一种改进的特征提取网络（IFENet）用于面向神经元的目标说话人提取，它主要由一个基于双通道Mamba的语音编码器和一个基于Kolmogorov-Arnold网络（KAN）的脑电编码器组成。我们提出了SpeechBiMamba，它利用双路径Mamba在本地和全局语音序列建模提取语音特征。此外，我们提出EEGKAN有效地提取与听觉刺激密切相关的EEG特征，并通过受试者的注意力信息定位目标说话人。在KUL和AVED数据集上的实验结果表明，IFENet模型在开放评估条件下，在尺度不变信号失真比（SI-SDR）方面的性能比现有模型分别提高了36%和29%.
摘要：The recent rapid development of auditory attention decoding (AAD) offers the possibility of using electroencephalography (EEG) as auxiliary information for target speaker extraction. However, effectively modeling long sequences of speech and resolving the identity of the target speaker from EEG signals remains a major challenge. In this paper, an improved feature extraction network (IFENet) is proposed for neuro-oriented target speaker extraction, which mainly consists of a speech encoder with dual-path Mamba and an EEG encoder with Kolmogorov-Arnold Networks (KAN). We propose SpeechBiMamba, which makes use of dual-path Mamba in modeling local and global speech sequences to extract speech features. In addition, we propose EEGKAN to effectively extract EEG features that are closely related to the auditory stimuli and locate the target speaker through the subject's attention information. Experiments on the KUL and AVED datasets show that IFENet outperforms the state-of-the-art model, achieving 36\% and 29\% relative improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) under an open evaluation condition.

【7】 An efficient light-weighted signal reconstruction method consists of Fast Fourier Transform and Convolutional-based Autoencoder
标题：一种由快速傅里叶变换和基于卷积的自动编码器组成的高效轻加权信号重建方法
链接：https://arxiv.org/abs/2501.01650

作者：Pu-Yun Kow, Pu-Zhao Kow
备注：13 pages
摘要：本文的主题是从中断的测量中重建音频信号。我们提出了一个轻量级的模型，只包括离散傅立叶变换和基于卷积的自动编码器模型（ConvAE），称为FFT-ConvAE模型的赫尔辛基语音挑战赛2024。FFT-ConvAE模型是轻量级的（就实时因素而言）和高效的（就字符错误率而言），这是由组织者验证的。此外，FFT-ConvAE是一种通用模型，能够通过统一配置处理所有任务。
摘要：The main theme of this paper is to reconstruct audio signal from interrupted measurements. We present a light-weighted model only consisting discrete Fourier transform and Convolutional-based Autoencoder model (ConvAE), called the FFT-ConvAE model for the Helsinki Speech Challenge 2024. The FFT-ConvAE model is light-weighted (in terms of real-time factor) and efficient (in terms of character error rate), which was verified by the organizers. Furthermore, the FFT-ConvAE is a general-purpose model capable of handling all tasks with a unified configuration.

【8】 Whisphone: Whispering Input Earbuds
标题：Whisphone：低语输入耳机
链接：https://arxiv.org/abs/2501.01636

作者：Masaaki Fukumoto
备注：10 pages, 5 figures. This is the English version of the paper: FUKUMOTO Masaaki. Whisphone: Whispering Input Earbuds. In Proceedings of WISS2024. pp.30-37 (2024) (in Japanese). Original paper (in Japanese): this https URL . Demo video: this https URL
摘要：Whisphone是一种新型耳塞设备，旨在通过耳语进行语音输入。利用在耳塞尖端放置独特麦克风的管道式耳塞，它可以有效地捕捉通过骨传导辐射在耳道中的低声声音。这种设计可以通过耳道阻塞效应提高低声音量，同时通过密封耳孔阻挡外部噪音。通过采用主动降噪（ANC），Whisphone可以有效地检测细微的耳语，即使在高达80dB（A）的嘈杂环境中。其紧凑舒适的设计确保了谨慎的可穿戴性，允许用户在办公室，家庭或城市公共空间等各种日常情况下与AI助手进行免提交互，而不会打扰他人。
摘要：Whisphone is a novel earbud device designed for speech input via whispering. Utilizing canal-type earbuds with a unique microphone placement at the tip of the earplug, it effectively captures whispered voices radiated in the ear canal through bone conduction. This design can boost whispered voice volume with ear canal occlusion effect while simultaneously blocking external noise by sealing the ear hole. By incorporating Active Noise Canceling (ANC), Whisphone can effectively detect subtle whispers, even in noisy environments of up to 80dB(A). Its compact and comfortable design ensures discreet wearability, allowing users to interact with AI assistants hands-free without disturbing others in various daily situations such as offices, homes, or urban public spaces.

【9】 Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift
标题：域转移下异常声音检测的分层特征分解
链接：https://arxiv.org/abs/2501.01604

作者：Jian Guan, Jiantong Tian, Qiaoxi Zhu, Feiyang Xiao, Hejing Zhang, Xubo Liu
备注：Accepted by ICASSP 2025
摘要：异常声音检测（ASD）遇到的困难与域转移，在目标域中的机器的声音显着不同的源域中，由于不同的操作条件。现有的方法通常采用领域分类器来提高检测性能，但它们往往忽略了领域无关信息的影响。这种疏忽可能会阻碍模型清晰区分域的能力，从而削弱其区分正常声音和异常声音的能力。在本文中，我们提出了一种基于梯度反转的层次特征解纠缠（GRHD）方法来解决上述挑战。GRHD使用梯度反转将领域相关特征与领域无关特征分开，从而产生更鲁棒的特征表示。此外，该方法采用分层结构，通过利用可用的元数据（例如部分ID和机器声音属性）来指导细粒度、特定于领域的功能的学习。在DCASE 2022 Challenge Task 2数据集上的实验结果表明，该方法显著提高了域偏移下的ASD性能。
摘要：Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing methods typically employ domain classifiers to enhance detection performance, but they often overlook the influence of domain-unrelated information. This oversight can hinder the model's ability to clearly distinguish between domains, thereby weakening its capacity to differentiate normal from abnormal sounds. In this paper, we propose a Gradient Reversal-based Hierarchical feature Disentanglement (GRHD) method to address the above challenge. GRHD uses gradient reversal to separate domain-related features from domain-unrelated ones, resulting in more robust feature representations. Additionally, the method employs a hierarchical structure to guide the learning of fine-grained, domain-specific features by leveraging available metadata, such as section IDs and machine sound attributes. Experimental results on the DCASE 2022 Challenge Task 2 dataset demonstrate that the proposed method significantly improves ASD performance under domain shift.

【10】 Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
标题：在鸡尾酒会上读书听：多模式言语分离
链接：https://arxiv.org/abs/2501.01518

作者：Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
备注：None
摘要：本文的目标是在多说话人和噪声环境中使用不同的模态组合的语音分离和增强。以前的作品在时间或静态视觉证据的条件下表现出良好的性能，例如同步的嘴唇运动或面部识别。在本文中，我们提出了一个统一的框架，多模态语音分离和增强的基础上同步或异步线索。为此，我们做出了以下贡献：（i）我们设计了一个现代的基于transformer的架构，用于融合不同的模态来解决原始波形域中的语音分离任务;（ii）我们提出了单独或结合视觉信息的句子文本内容的条件反射;（iii）我们证明了我们的模型对视听同步偏移的鲁棒性;以及，（iv）我们在完善的基准数据集LRS 2和LRS 3上获得了最先进的性能。
摘要：The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.

eess.AS音频处理

【1】 Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
标题：在鸡尾酒会上读书听：多模式言语分离
链接：https://arxiv.org/abs/2501.01518

作者：Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
备注：None
摘要：本文的目标是在多说话人和噪声环境中使用不同的模态组合的语音分离和增强。以前的作品在时间或静态视觉证据的条件下表现出良好的性能，例如同步的嘴唇运动或面部识别。在本文中，我们提出了一个统一的框架，多模态语音分离和增强的基础上同步或异步线索。为此，我们做出了以下贡献：（i）我们设计了一个现代的基于transformer的架构，用于融合不同的模态来解决原始波形域中的语音分离任务;（ii）我们提出了单独或与视觉信息相结合的句子文本内容的条件;（iii）我们证明了我们的模型对视听同步偏移的鲁棒性;以及，（iv）我们在完善的基准数据集LRS 2和LRS 3上获得了最先进的性能。
摘要：The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.

【2】 VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
标题：VitA-1.5：迈向GPT-4 o级实时视觉和语音交互
链接：https://arxiv.org/abs/2501.01957

【3】 Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) for Passive Sonar Classification
标题：用于被动声纳分类的结构和统计音频纹理知识提取（SATKD）
链接：https://arxiv.org/abs/2501.01921

【4】 CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation
标题：CycleFlow：利用流量匹配中的循环一致性来适应说话者风格
链接：https://arxiv.org/abs/2501.01861

【5】 MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling
标题：MusicGen-Stem：通过自回归建模生成多干音乐
链接：https://arxiv.org/abs/2501.01757

【6】 Controlling your Attributes in Voice
标题：控制语音中的属性
链接：https://arxiv.org/abs/2501.01674

【7】 Improved Feature Extraction Network for Neuro-Oriented Target Speaker Extraction
标题：面向神经的目标说话人提取的改进特征提取网络
链接：https://arxiv.org/abs/2501.01673

【8】 An efficient light-weighted signal reconstruction method consists of Fast Fourier Transform and Convolutional-based Autoencoder
标题：一种由快速傅里叶变换和基于卷积的自动编码器组成的高效轻加权信号重建方法
链接：https://arxiv.org/abs/2501.01650

【9】 Whisphone: Whispering Input Earbuds
标题：Whisphone：低语输入耳机
链接：https://arxiv.org/abs/2501.01636

【10】 Disentangling Hierarchical Features for Anomalous Sound Detection Under Domain Shift
标题：域转移下异常声音检测的分层特征分解
链接：https://arxiv.org/abs/2501.01604

作者：Jian Guan, Jiantong Tian, Qiaoxi Zhu, Feiyang Xiao, Hejing Zhang, Xubo Liu
备注：Accepted by ICASSP 2025
摘要：异常声音检测（ASD）遇到的困难与域转移，在目标域中的机器的声音显着不同，由于不同的操作条件，在源域。现有的方法通常采用领域分类器来提高检测性能，但它们往往忽略了领域无关信息的影响。这种疏忽可能会阻碍模型清晰区分域的能力，从而削弱其区分正常声音和异常声音的能力。在本文中，我们提出了一种基于梯度反转的层次特征解纠缠（GRHD）方法来解决上述挑战。GRHD使用梯度反转将域相关特征与域无关特征分离，从而产生更鲁棒的特征表示。此外，该方法采用分层结构，通过利用可用的元数据（例如部分ID和机器声音属性）来指导细粒度、特定于领域的功能的学习。在DCASE 2022 Challenge Task 2数据集上的实验结果表明，该方法显著提高了域偏移下的ASD性能。
摘要：Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing methods typically employ domain classifiers to enhance detection performance, but they often overlook the influence of domain-unrelated information. This oversight can hinder the model's ability to clearly distinguish between domains, thereby weakening its capacity to differentiate normal from abnormal sounds. In this paper, we propose a Gradient Reversal-based Hierarchical feature Disentanglement (GRHD) method to address the above challenge. GRHD uses gradient reversal to separate domain-related features from domain-unrelated ones, resulting in more robust feature representations. Additionally, the method employs a hierarchical structure to guide the learning of fine-grained, domain-specific features by leveraging available metadata, such as section IDs and machine sound attributes. Experimental results on the DCASE 2022 Challenge Task 2 dataset demonstrate that the proposed method significantly improves ASD performance under domain shift.

机器翻译由腾讯交互翻译提供，仅供参考

永久福利直投简历

简历投递：join@speechhome.com

扫码关注我们

助力AI语音开发者的社区

语音之家

助力AI语音开发者的社区