本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2412.18603
摘要:我们考虑了多分钟内语音的生成建模,这是长格式多媒体生成和音频原生语音助手的要求。然而,当前的口语模型很难在几十秒内生成合理的语音,从导致连贯性损失的语音标记的高时间分辨率,到长序列训练或外推的架构问题,再到推理时的记忆成本。考虑到这些因素,我们提出了SpeechSSM,这是第一个学习和采样长格式口语音频的语音语言模型(例如,16分钟的阅读或即兴演讲)在一个单一的解码会话没有文本中间,基于线性时间序列建模的最新进展。此外,为了解决口语评估中日益增长的挑战,特别是在这种新的长形式环境中,我们提出:新的基于嵌入和法学硕士评判的指标;长度和时间的质量测量;以及长形式语音处理和生成的新基准,LibriSpeech-Long。语音样本和数据集发布于https://google.github.io/tacotron/publications/speechssm/
摘要:We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/
链接:https://arxiv.org/abs/2412.18495
备注:Accepted at TACL
摘要:同步语音到文本翻译(SimulST)将源语言语音与说话者的语音同时翻译为目标语言文本,确保低延迟,以提高用户的理解能力。尽管其旨在应用于无界语音,但大多数研究都集中在人类预分割语音上,简化了任务并忽视了重大挑战。这种狭隘的关注,加上广泛的术语不一致,限制了研究成果对现实世界应用的适用性,最终阻碍了该领域的进展。我们对110篇论文进行了广泛的文献综述,不仅揭示了当前研究中的这些关键问题,而且还为我们的主要贡献奠定了基础。我们1)定义步骤和核心组件的SimulST系统,提出了一个标准化的术语和分类; 2)进行社区趋势的彻底分析,和3)提供具体的建议和未来的方向,以弥合现有文献中的差距,从评估框架到系统架构,推进该领域向更现实和有效的SimulST解决方案。
摘要:Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends, and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.
标题:利用多层感知器和LSTM从语音信号特征检测和预测帕金森病进展
链接:https://arxiv.org/abs/2412.18248
摘要:帕金森病的准确诊断,尤其是在其早期阶段,可能是一项具有挑战性的任务。机器学习技术的应用有助于提高帕金森病检测的诊断准确性,但只有少数研究提出了预测疾病进展的工作。在这项研究工作中,长短期记忆LSTM使用帕金森患者语音信号的诊断特征进行训练,以预测疾病进展,而多层感知器MLP则使用相同的诊断特征进行训练以检测疾病。使用两种著名的特征选择方法(Relief-F和Sequential Forward Selection)选择并应用于LSTM和MLP的诊断特征已被证明可以分别准确预测疾病进展为2期和3期及其存在。
摘要:Accurate diagnosis of Parkinson disease, especially in its early stages, can be a challenging task. The application of machine learning techniques helps improve the diagnostic accuracy of Parkinson disease detection but only few studies have presented work towards the prediction of disease progression. In this research work, Long Short Term Memory LSTM was trained using the diagnostic features on Parkinson patients speech signals, to predict the disease progression while a Multilayer Perceptron MLP was trained on the same diagnostic features to detect the disease. Diagnostic features selected using two well-known feature selection methods named Relief-F and Sequential Forward Selection and applied on LSTM and MLP have shown to accurately predict the disease progression as stage 2 and 3 and its existence respectively.
标题:U-Mamba-Net:一种基于Mamba的高效U-net风格网络,用于噪音和回响语音分离
链接:https://arxiv.org/abs/2412.18217
备注:None
摘要:语音分离的主题涉及将具有多个重叠说话者的混合语音分离成若干个流,每个流仅包含来自一个说话者的语音。随着时间的推移,许多高效的模型已经出现并迅速扩散。然而,这些模型的大小和计算负荷也相应地增加。这对社区来说是一场灾难,因为研究人员需要更多的时间和计算资源来复制和比较现有的模型。在本文中,我们提出了U-mamba-net:一个轻量级的基于曼巴的U型模型,用于复杂环境下的语音分离。Mamba是一个状态空间序列模型,它包含了特征选择功能。U型网络是一种完全卷积的神经网络,其对称的收缩和扩张路径能够学习多分辨率特征。在我们的工作中,Mamba充当特征过滤器,与U-Net交替使用。我们在Libri 2 mix上测试了所提出的模型。结果表明,U-Mamba-Net以相当低的计算成本实现了性能的改善。
摘要:The topic of speech separation involves separating mixed speech with multiple overlapping speakers into several streams, with each stream containing speech from only one speaker. Many highly effective models have emerged and proliferated rapidly over time. However, the size and computational load of these models have also increased accordingly. This is a disaster for the community, as researchers need more time and computational resources to reproduce and compare existing models. In this paper, we propose U-mamba-net: a lightweight Mamba-based U-style model for speech separation in complex environments. Mamba is a state space sequence model that incorporates feature selection capabilities. U-style network is a fully convolutional neural network whose symmetric contracting and expansive paths are able to learn multi-resolution features. In our work, Mamba serves as a feature filter, alternating with U-Net. We test the proposed model on Libri2mix. The results show that U-Mamba-Net achieves improved performance with quite low computational cost.
标题:通过探测解释说话者和恶搞嵌入
链接:https://arxiv.org/abs/2412.18191
备注:To appear in IEEE ICASSP 2025
摘要:这项研究调查了嵌入表示的可解释性,特别是那些基于深度神经网络的现代音频欺骗检测系统中使用的嵌入表示,称为欺骗嵌入。建立在扬声器嵌入可解释性的工作,我们研究如何以及这些欺骗嵌入捕获扬声器相关的信息。我们训练简单的神经分类器使用扬声器或欺骗嵌入作为输入,与扬声器相关的属性作为目标标签。这些属性被分为两组:基于元数据的特征(例如,性别,年龄)和声学特征(例如,基本频率、说话速率)。我们在ASVspoof 2019 LA评估集上的实验表明,欺骗嵌入保留了几个关键特征,包括性别,说话速率,F0和持续时间。对性别和说话速率的进一步分析表明,欺骗检测器部分保留了这些特征,可能会确保决策过程对它们保持鲁棒性。
摘要:This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.
标题:Smuth-Foley:在语义指导下为视频到音频生成创建连续声音
链接:https://arxiv.org/abs/2412.18157
摘要:视频到音频(V2 A)生成任务由于在产生Foley声音中的实用性而在多媒体领域中引起关注。语义和时间条件被馈送到生成模型以指示声音事件和时间发生。最近的研究合成沉浸式和同步音频面临着挑战的视频与移动的视觉存在。时间条件不够准确,而低分辨率的语义条件则加剧了这一问题。为了解决这些挑战,我们提出了Smooth-Foley,这是一种V2 A生成模型,它从整个生成过程中的文本标签中获取语义指导,以增强音频中的语义和时间对齐。训练两个适配器以利用预训练的文本到音频生成模型。帧适配器集成高分辨率逐帧视频特征,而时间适配器集成从视觉帧和文本标签的相似性获得的时间条件。结合来自文本标签的语义指导实现了精确的音频-视频对齐。我们进行了大量的定量和定性实验。结果表明,Smooth-Foley模型在连续声场景和一般场景下的性能均优于现有模型。在语义指导下,Smooth-Foley生成的音频具有更高的质量和更好的物理定律。
摘要:The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.
标题:Lla-VAP:LSTM融合Llama和VAP,用于回合转换预测
链接:https://arxiv.org/abs/2412.18061
摘要:话轮转换预测是预测会话中说话人何时会让位给另一个说话人开始说话的任务。该项目扩展了现有的话轮转换预测策略,采用多模态集成方法,集成了大型语言模型(LLM)和语音活动投影(VAP)模型。通过将LLM的语言能力与VAP模型的时间精度相结合,我们的目标是提高在脚本和非脚本会话场景中识别TRP的准确性和效率。我们的方法在会话语料库(ICC)和指导会话偏好诱导(CCPE)数据集上进行了评估,突出了当前模型的优势和局限性,同时提出了一个潜在的更强大的框架来增强预测。
摘要:Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.
标题:音频DeepFake检测模型是多语言的吗?
链接:https://arxiv.org/abs/2412.17924
备注:Keywords: Audio DeepFakes, DeepFake detection, multilingual audio DeepFakes
摘要:由于大多数音频DeepFake(DF)检测方法都是在以英语为中心的数据集上训练的,因此它们对非英语语言的适用性在很大程度上尚未得到探索。在这项工作中,我们提出了一个基准的多语言音频DF检测的挑战,通过评估各种适应策略。我们的实验专注于分析在英语基准数据集上训练的模型,以及语言内(同一语言)和跨语言适应方法。我们的研究结果表明,相当大的变化,检测效率,突出了多语种设置的困难。我们表明,将数据集限制为英语会对有效性产生负面影响,同时强调目标语言数据的重要性。
摘要:Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.
标题:多模式情感识别系统:集成面部表情、身体运动、语音和口语
链接:https://arxiv.org/abs/2412.17907
备注:10 pages, 6 figures, 3 tables
摘要:传统的心理评估严重依赖于人类的观察和解释,这容易产生主观性,偏见,疲劳和不一致性。为了解决这些局限性,这项工作提出了一个多模态情感识别系统,提供了一个标准化的,客观的,数据驱动的工具,以支持评估,如心理学家,精神病学家和临床医生。该系统集成了面部表情、语音、口语和身体运动分析的识别,以捕捉在人类评估中经常被忽视的微妙情感线索。通过结合这些模式,该系统提供了更强大和全面的情绪状态评估,减少了误诊和过度诊断的风险。在模拟现实世界条件下的初步测试表明,该系统有潜力提供可靠的情感洞察力,以提高诊断的准确性。这项工作突出了自动多模态分析作为传统心理评估实践的一个有价值的补充,在临床和治疗环境中的应用的承诺。
摘要:Traditional psychological evaluations rely heavily on human observation and interpretation, which are prone to subjectivity, bias, fatigue, and inconsistency. To address these limitations, this work presents a multimodal emotion recognition system that provides a standardised, objective, and data-driven tool to support evaluators, such as psychologists, psychiatrists, and clinicians. The system integrates recognition of facial expressions, speech, spoken language, and body movement analysis to capture subtle emotional cues that are often overlooked in human evaluations. By combining these modalities, the system provides more robust and comprehensive emotional state assessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in a simulated real-world condition demonstrates the system's potential to provide reliable emotional insights to improve the diagnostic accuracy. This work highlights the promise of automated multimodal analysis as a valuable complement to traditional psychological evaluation practices, with applications in clinical and therapeutic settings.
标题:高噪音环境下双麦克风阵列神经定向语音增强
链接:https://arxiv.org/abs/2412.18141
备注:Accepted by ICASSP 2025
摘要:在多说话人场景中,利用空间特征对增强目标语音至关重要。在麦克风阵列有限的情况下,开发紧凑的多通道语音增强系统仍然具有挑战性,特别是在极低的信噪比(SNR)条件下。为了解决这个问题,我们提出了一个三重导向空间选择方法,一个灵活的框架,使用三个导向矢量来指导增强和确定增强范围。具体来说,我们引入了一个cabinet导向的U-Net(CDUNet)模型,该模型以原始多通道语音和所需的增强宽度作为输入。这使得能够基于目标方向来动态调整导向矢量,并且根据目标信号与干扰信号之间的角度间隔来微调增强区域。我们的模型只有一个双麦克风阵列,在语音质量和下游任务性能方面都很出色。它以最少的参数实时运行,非常适合低延迟的设备上流媒体应用。
摘要:In multi-speaker scenarios, leveraging spatial features is essential for enhancing target speech. While with limited microphone arrays, developing a compact multi-channel speech enhancement system remains challenging, especially in extremely low signal-to-noise ratio (SNR) conditions. To tackle this issue, we propose a triple-steering spatial selection method, a flexible framework that uses three steering vectors to guide enhancement and determine the enhancement range. Specifically, we introduce a causal-directed U-Net (CDUNet) model, which takes raw multi-channel speech and the desired enhancement width as inputs. This enables dynamic adjustment of steering vectors based on the target direction and fine-tuning of the enhancement region according to the angular separation between the target and interference signals. Our model with only a dual microphone array, excels in both speech quality and downstream task performance. It operates in real-time with minimal parameters, making it ideal for low-latency, on-device streaming applications.
标题:SongGLM:采用2D对齐编码和多任务预训练的歌词到旋律生成
链接:https://arxiv.org/abs/2412.18107
备注:Extended version of paper accepted to AAAI 2025
摘要:歌词到旋律生成的目标是根据给定的歌词自动创建旋律,需要捕捉它们之间复杂而微妙的相关性。然而,以前的作品通常面临两个主要挑战:1)歌词旋律对齐建模,通常被简化为单音节/词到单音符对齐,而其他人则存在对齐精度低的问题; 2)歌词旋律和声建模,通常严重依赖中间或严格规则,限制了模型的能力和生成多样性。在本文中,我们提出了SongGLM,一个歌词到旋律生成系统,利用二维对齐编码和基于通用语言模型(GLM)的多任务预训练,以保证歌词和旋律之间的对齐和和谐。具体而言,1)我们引入了一个统一的符号歌曲表示的歌词和旋律与词级和短语级(2D)对齐编码,以捕捉歌词旋律对齐; 2)设计了一个多任务预训练框架,以n-gram、phrase和longspan为分层填充目标,并将歌词与旋律的关系融入到协调n-gram的提取中,以保证歌词与旋律的和谐。我们还构建了一个大规模的歌词旋律配对数据集,包括超过20万首英文歌曲,用于预训练和微调。客观和主观的结果表明,SongGLM可以从歌词中生成旋律,在对齐和和声方面都有显着改善,优于所有以前的基线方法。
摘要:Lyric-to-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melody harmony modeling, which usually relies heavily on intermediates or strict rules, limiting model's capabilities and generative diversity. In this paper, we propose SongGLM, a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training based on the General Language Model (GLM) to guarantee the alignment and harmony between lyrics and melodies. Specifically, 1) we introduce a unified symbolic song representation for lyrics and melodies with word-level and phrase-level (2D) alignment encoding to capture the lyric-melody alignment; 2) we design a multi-task pre-training framework with hierarchical blank infilling objectives (n-gram, phrase, and long span), and incorporate lyric-melody relationships into the extraction of harmonized n-grams to ensure the lyric-melody harmony. We also construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning. The objective and subjective results indicate that SongGLM can generate melodies from lyrics with significant improvements in both alignment and harmony, outperforming all the previous baseline methods.
标题:Noiseduce:时间序列信号的域通用降噪
链接:https://arxiv.org/abs/2412.17851
备注:Python library: this https URL or `pip install noisereduce`
摘要:从噪声背景中提取信号是信号处理中的一个基本问题。在本文中,我们介绍了Noiseduce,一种用于最小化各种领域噪声的算法,包括语音,生物声学,神经生理学和地震学。NoiseReduce使用频谱门来估计频域掩码,有效地将信号与噪声分离。它速度快,重量轻,不需要训练数据,可以处理静态和非静态噪声,使其成为一种通用工具,也是与特定领域应用进行比较的方便基线。我们提供了一个详细的概述,并评估其性能的各种时域信号。
摘要:Extracting signals from noisy backgrounds is a fundamental problem in signal processing across a variety of domains. In this paper, we introduce Noisereduce, an algorithm for minimizing noise across a variety of domains, including speech, bioacoustics, neurophysiology, and seismology. Noisereduce uses spectral gating to estimate a frequency-domain mask that effectively separates signals from noise. It is fast, lightweight, requires no training data, and handles both stationary and non-stationary noise, making it both a versatile tool and a convenient baseline for comparison with domain-specific applications. We provide a detailed overview of Noisereduce and evaluate its performance on a variety of time-domain signals.
标题:一种Zero-Shot物理信息词典学习方法用于声学重建
链接:https://arxiv.org/abs/2412.18348
备注:Accepted for publication at ICASSP 2025
摘要:声场重建的目的是在缺乏直接测量的地区估计压力场。现有技术通常依赖于强有力的假设,或者面临与数据可用性或物理特性的显式建模相关的挑战。为了弥合这些差距,本研究引入了一种zero-shot,物理信息字典学习方法来执行声场重建。我们的方法只依赖于一些稀疏的测量来学习字典,而不需要额外的训练数据。此外,通过在优化过程中执行亥姆霍兹方程,所提出的方法确保重建的声场被表示为一些物理上有意义的原子的线性组合。对真实世界数据的评估表明,我们的方法实现了与最先进的字典学习技术相当的性能,其优点是只需要对声场进行少量观察,并且不需要对数据集进行训练。
摘要:Sound field reconstruction aims to estimate pressure fields in areas lacking direct measurements. Existing techniques often rely on strong assumptions or face challenges related to data availability or the explicit modeling of physical properties. To bridge these gaps, this study introduces a zero-shot, physics-informed dictionary learning approach to perform sound field reconstruction. Our method relies only on a few sparse measurements to learn a dictionary, without the need for additional training data. Moreover, by enforcing the Helmholtz equation during the optimization process, the proposed approach ensures that the reconstructed sound field is represented as a linear combination of a few physically meaningful atoms. Evaluations on real-world data show that our approach achieves comparable performance to state-of-the-art dictionary learning techniques, with the advantage of requiring only a few observations of the sound field and no training on a dataset.
标题:用于Few-Shot关键字定位的文本感知适配器
链接:https://arxiv.org/abs/2412.18142
备注:5 pages, 3 figures, Accepted by ICASSP 2025
摘要:灵活的关键字定位(KWS)与文本注册的最新进展允许用户个性化的关键字,而无需在注册过程中说出它们。然而,目标关键字性能仍有改进的空间。在这项工作中,我们提出了一种新的Few-Shot迁移学习方法,称为文本感知适配器(TA适配器),旨在增强特定关键字的预训练灵活的KWS模型有限的语音样本。为了适应声学编码器,我们利用联合预训练的文本编码器来生成文本嵌入,作为关键字的代表向量。通过仅微调网络的一小部分,同时保持核心组件的权重不变,TA适配器被证明对Few-Shot KWS非常有效,能够无缝返回到原始的预训练模型。在我们的实验中,TA适配器在Google Speech Commands V2数据集的35个不同关键字上表现出显着的性能改进,参数总数仅增加了0.14%。
摘要:Recent advances in flexible keyword spotting (KWS) with text enrollment allow users to personalize keywords without uttering them during enrollment. However, there is still room for improvement in target keyword performance. In this work, we propose a novel few-shot transfer learning method, called text-aware adapter (TA-adapter), designed to enhance a pre-trained flexible KWS model for specific keywords with limited speech samples. To adapt the acoustic encoder, we leverage a jointly pre-trained text encoder to generate a text embedding that acts as a representative vector for the keyword. By fine-tuning only a small portion of the network while keeping the core components' weights intact, the TA-adapter proves highly efficient for few-shot KWS, enabling a seamless return to the original pre-trained model. In our experiments, the TA-adapter demonstrated significant performance improvements across 35 distinct keywords from the Google Speech Commands V2 dataset, with only a 0.14% increase in the total number of parameters.
标题:高噪音环境下双麦克风阵列神经定向语音增强
链接:https://arxiv.org/abs/2412.18141
备注:Accepted by ICASSP 2025
摘要:在多说话人场景中,利用空间特征对增强目标语音至关重要。在麦克风阵列有限的情况下,开发紧凑的多通道语音增强系统仍然具有挑战性,特别是在极低的信噪比(SNR)条件下。为了解决这个问题,我们提出了一个三重导向空间选择方法,一个灵活的框架,使用三个导向矢量来指导增强和确定增强范围。具体来说,我们引入了一个cabinet导向的U-Net(CDUNet)模型,该模型以原始多通道语音和所需的增强宽度作为输入。这使得能够基于目标方向来动态调整导向矢量,并且根据目标信号与干扰信号之间的角度间隔来微调增强区域。我们的模型只有一个双麦克风阵列,在语音质量和下游任务性能方面都很出色。它以最少的参数实时运行,非常适合低延迟的设备上流媒体应用。
摘要:In multi-speaker scenarios, leveraging spatial features is essential for enhancing target speech. While with limited microphone arrays, developing a compact multi-channel speech enhancement system remains challenging, especially in extremely low signal-to-noise ratio (SNR) conditions. To tackle this issue, we propose a triple-steering spatial selection method, a flexible framework that uses three steering vectors to guide enhancement and determine the enhancement range. Specifically, we introduce a causal-directed U-Net (CDUNet) model, which takes raw multi-channel speech and the desired enhancement width as inputs. This enables dynamic adjustment of steering vectors based on the target direction and fine-tuning of the enhancement region according to the angular separation between the target and interference signals. Our model with only a dual microphone array, excels in both speech quality and downstream task performance. It operates in real-time with minimal parameters, making it ideal for low-latency, on-device streaming applications.
标题:SongGLM:采用2D对齐编码和多任务预训练的歌词到旋律生成
链接:https://arxiv.org/abs/2412.18107
备注:Extended version of paper accepted to AAAI 2025
摘要:歌词到旋律生成的目标是根据给定的歌词自动创建旋律,需要捕捉它们之间复杂而微妙的相关性。然而,以前的作品通常面临两个主要挑战:1)歌词旋律对齐建模,通常被简化为单音节/词到单音符对齐,而其他人则存在对齐精度低的问题; 2)歌词旋律和声建模,通常严重依赖中间或严格规则,限制了模型的能力和生成多样性。在本文中,我们提出了SongGLM,一个歌词到旋律生成系统,利用二维对齐编码和基于通用语言模型(GLM)的多任务预训练,以保证歌词和旋律之间的对齐和和谐。具体而言,1)我们引入了一个统一的符号歌曲表示的歌词和旋律与词级和短语级(2D)对齐编码,以捕捉歌词旋律对齐; 2)设计了一个多任务预训练框架,以n-gram、phrase和longspan为分层填充目标,并将歌词与旋律的关系融入到协调n-gram的提取中,以保证歌词与旋律的和谐。我们还构建了一个大规模的歌词旋律配对数据集,包括超过20万首英文歌曲,用于预训练和微调。客观和主观的结果表明,SongGLM可以从歌词中生成旋律,在对齐和和声方面都有显着改善,优于所有以前的基线方法。
摘要:Lyric-to-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melody harmony modeling, which usually relies heavily on intermediates or strict rules, limiting model's capabilities and generative diversity. In this paper, we propose SongGLM, a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training based on the General Language Model (GLM) to guarantee the alignment and harmony between lyrics and melodies. Specifically, 1) we introduce a unified symbolic song representation for lyrics and melodies with word-level and phrase-level (2D) alignment encoding to capture the lyric-melody alignment; 2) we design a multi-task pre-training framework with hierarchical blank infilling objectives (n-gram, phrase, and long span), and incorporate lyric-melody relationships into the extraction of harmonized n-grams to ensure the lyric-melody harmony. We also construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning. The objective and subjective results indicate that SongGLM can generate melodies from lyrics with significant improvements in both alignment and harmony, outperforming all the previous baseline methods.
标题:使用口语模型的长形式语音生成
链接:https://arxiv.org/abs/2412.18603
摘要:我们考虑了多分钟内语音的生成建模,这是长格式多媒体生成和音频原生语音助手的要求。然而,当前的口语模型很难在几十秒内生成合理的语音,从导致连贯性损失的语音标记的高时间分辨率,到长序列训练或外推的架构问题,再到推理时的记忆成本。考虑到这些因素,我们提出了SpeechSSM,这是第一个学习和采样长格式口语音频的语音语言模型(例如,16分钟的阅读或即兴演讲)在一个单一的解码会话没有文本中间,基于线性时间序列建模的最新进展。此外,为了应对口语评估中日益增长的挑战,特别是在这种新的长格式设置中,我们提出:新的基于嵌入和LLM判断的指标;长度和时间的质量测量;以及长格式语音处理和生成的新基准LibriSpeech-Long。语音样本和数据集发布于https://google.github.io/tacotron/publications/speechssm/
摘要:We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/
标题:利用LLM实现零资源语音翻译和识别
链接:https://arxiv.org/abs/2412.18566
备注:ICASSP 2025, 5 pages, 2 figures, 2 tables
摘要:尽管最近在语音处理方面取得了进展,但零资源语音翻译(ST)和自动语音识别(ASR)仍然是具有挑战性的问题。在这项工作中,我们建议利用多语言大语言模型(LLM)来执行ST和ASR的语言,该模型从未见过配对的音频文本数据。我们通过使用预训练的多语言语音编码器,多语言LLM和轻量级自适应模块来实现这一点,该模块将音频表示映射到LLM的令牌嵌入空间。我们在ST和ASR中执行了几个实验,以了解如何最好地训练模型,以及哪些数据对以前看不见的语言的性能影响最大。在ST中,我们最好的模型能够在CoVoST 2中为两种以前看不见的语言实现超过23的BLEU分数,而在ASR中,我们实现了高达28.2%的WER。我们最后表明,我们的系统的性能是有限的LLM输出所需语言的文本的能力。
摘要:Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
链接:https://arxiv.org/abs/2412.18495
备注:Accepted at TACL
摘要:同步语音到文本翻译(SimulST)将源语言语音与说话者的语音同时翻译为目标语言文本,确保低延迟,以提高用户的理解能力。尽管其预期应用于无界语音,但大多数研究都集中在人类预分段语音上,简化了任务并忽视了重大挑战。这种狭隘的关注,加上广泛的术语不一致,限制了研究成果对现实世界应用的适用性,最终阻碍了该领域的进展。我们对110篇论文进行了广泛的文献综述,不仅揭示了当前研究中的这些关键问题,而且还为我们的主要贡献奠定了基础。我们1)定义步骤和核心组件的SimulST系统,提出了一个标准化的术语和分类; 2)进行社区趋势的彻底分析,和3)提供具体的建议和未来的方向,以弥合现有文献中的差距,从评估框架到系统架构,推进该领域向更现实和有效的SimulST解决方案。
摘要:Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends, and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.
标题:利用多层感知器和LSTM从语音信号特征检测和预测帕金森病进展
链接:https://arxiv.org/abs/2412.18248
摘要:帕金森病的准确诊断,尤其是在其早期阶段,可能是一项具有挑战性的任务。机器学习技术的应用有助于提高帕金森病检测的诊断准确性,但只有少数研究提出了预测疾病进展的工作。在这项研究工作中,使用帕金森患者语音信号的诊断特征来训练长短期记忆LSTM,以预测疾病进展,而多层感知器MLP则使用相同的诊断特征来训练以检测疾病。使用两种著名的特征选择方法(Relief-F和Sequential Forward Selection)选择并应用于LSTM和MLP的诊断特征已被证明可以分别准确预测疾病进展为2期和3期及其存在。
摘要:Accurate diagnosis of Parkinson disease, especially in its early stages, can be a challenging task. The application of machine learning techniques helps improve the diagnostic accuracy of Parkinson disease detection but only few studies have presented work towards the prediction of disease progression. In this research work, Long Short Term Memory LSTM was trained using the diagnostic features on Parkinson patients speech signals, to predict the disease progression while a Multilayer Perceptron MLP was trained on the same diagnostic features to detect the disease. Diagnostic features selected using two well-known feature selection methods named Relief-F and Sequential Forward Selection and applied on LSTM and MLP have shown to accurately predict the disease progression as stage 2 and 3 and its existence respectively.
标题:U-Mamba-Net:一种基于Mamba的高效U-net风格网络,用于噪音和回响语音分离
链接:https://arxiv.org/abs/2412.18217
备注:None
摘要:语音分离的主题涉及将具有多个重叠说话者的混合语音分离成若干个流,每个流仅包含来自一个说话者的语音。随着时间的推移,许多高效的模型已经出现并迅速扩散。然而,这些模型的规模和计算量也相应增加。这对社区来说是一场灾难,因为研究人员需要更多的时间和计算资源来复制和比较现有的模型。在本文中,我们提出了U-mamba-net:一个轻量级的基于曼巴的U型模型,用于复杂环境下的语音分离。Mamba是一个状态空间序列模型,它包含了特征选择功能。U型网络是一种完全卷积的神经网络,其对称的收缩和扩张路径能够学习多分辨率特征。在我们的工作中,曼巴作为一个特征过滤器,与U-Net交替使用。我们在Libri 2 mix上对所提出的模型进行了测试。实验结果表明,U-Mamba-Net算法能够以较低的计算代价获得较好的性能。
摘要:The topic of speech separation involves separating mixed speech with multiple overlapping speakers into several streams, with each stream containing speech from only one speaker. Many highly effective models have emerged and proliferated rapidly over time. However, the size and computational load of these models have also increased accordingly. This is a disaster for the community, as researchers need more time and computational resources to reproduce and compare existing models. In this paper, we propose U-mamba-net: a lightweight Mamba-based U-style model for speech separation in complex environments. Mamba is a state space sequence model that incorporates feature selection capabilities. U-style network is a fully convolutional neural network whose symmetric contracting and expansive paths are able to learn multi-resolution features. In our work, Mamba serves as a feature filter, alternating with U-Net. We test the proposed model on Libri2mix. The results show that U-Mamba-Net achieves improved performance with quite low computational cost.
标题:通过探测解释说话者和恶搞嵌入
链接:https://arxiv.org/abs/2412.18191
备注:To appear in IEEE ICASSP 2025
摘要:这项研究调查了嵌入表示的可解释性,特别是那些基于深度神经网络的现代音频欺骗检测系统中使用的嵌入表示,称为欺骗嵌入。建立在扬声器嵌入可解释性的工作,我们研究如何以及这些欺骗嵌入捕获扬声器相关的信息。我们训练简单的神经分类器使用扬声器或欺骗嵌入作为输入,与扬声器相关的属性作为目标标签。这些属性被分为两组:基于元数据的特征(例如,性别,年龄)和声学特征(例如,基本频率、说话速率)。我们在ASVspoof 2019 LA评估集上的实验表明,欺骗嵌入保留了几个关键特征,包括性别,说话速率,F0和持续时间。对性别和说话速率的进一步分析表明,欺骗检测器部分保留了这些特征,可能会确保决策过程对它们保持鲁棒性。
摘要:This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.
标题:Smuth-Foley:在语义指导下为视频到音频生成创建连续声音
链接:https://arxiv.org/abs/2412.18157
摘要:视频到音频(V2 A)生成任务由于在产生Foley声音中的实用性而在多媒体领域中引起关注。语义和时间条件被馈送到生成模型以指示声音事件和时间发生。最近的研究合成沉浸式和同步音频面临着挑战的视频与移动的视觉存在。时间条件不够准确,而低分辨率的语义条件加剧了问题。为了解决这些挑战,我们提出了Smooth-Foley,这是一种V2 A生成模型,它从整个生成过程中的文本标签中获取语义指导,以增强音频中的语义和时间对齐。训练两个适配器以利用预训练的文本到音频生成模型。帧适配器集成高分辨率逐帧视频特征,而时间适配器集成从视觉帧和文本标签的相似性获得的时间条件。结合来自文本标签的语义指导实现了精确的音频-视频对齐。我们进行了大量的定量和定性实验。结果表明,Smooth-Foley模型在连续声场景和一般场景下的性能均优于现有模型。在语义指导下,Smooth-Foley生成的音频具有更高的质量和更好的物理定律。
摘要:The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.
标题:Lla-VAP:LSTM融合Llama和VAP,用于回合转换预测
链接:https://arxiv.org/abs/2412.18061
摘要:话轮转换预测是预测会话中说话人何时会让位给另一个说话人开始说话的任务。该项目扩展了现有的话轮转换预测策略,采用多模态集成方法,集成了大型语言模型(LLM)和语音活动投影(VAP)模型。通过将LLM的语言能力与VAP模型的时间精度相结合,我们的目标是提高在脚本和非脚本会话场景中识别TRP的准确性和效率。我们的方法在会话语料库(ICC)和指导会话偏好诱导(CCPE)数据集上进行了评估,突出了当前模型的优势和局限性,同时提出了一个潜在的更强大的框架来增强预测。
摘要:Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.
标题:音频DeepFake检测模型是多语言的吗?
链接:https://arxiv.org/abs/2412.17924
备注:Keywords: Audio DeepFakes, DeepFake detection, multilingual audio DeepFakes
摘要:由于大多数音频DeepFake(DF)检测方法都是在以英语为中心的数据集上训练的,因此它们对非英语语言的适用性在很大程度上尚未得到探索。在这项工作中,我们提出了一个基准的多语言音频DF检测的挑战,通过评估各种适应策略。我们的实验专注于分析在英语基准数据集上训练的模型,以及语言内(同一语言)和跨语言适应方法。我们的研究结果表明,相当大的变化,检测效率,突出了多语种设置的困难。我们表明,将数据集限制为英语会对有效性产生负面影响,同时强调目标语言数据的重要性。
摘要:Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.
标题:多模式情感识别系统:集成面部表情、身体运动、语音和口语
链接:https://arxiv.org/abs/2412.17907
备注:10 pages, 6 figures, 3 tables
摘要:传统的心理评估严重依赖于人类的观察和解释,这容易产生主观性,偏见,疲劳和不一致性。为了解决这些局限性,这项工作提出了一个多模态情感识别系统,提供了一个标准化的,客观的,数据驱动的工具,以支持评估,如心理学家,精神病学家和临床医生。该系统集成了面部表情、语音、口语和身体运动分析的识别,以捕捉在人类评估中经常被忽视的微妙情感线索。通过结合这些模式,该系统提供了更强大和全面的情绪状态评估,减少了误诊和过度诊断的风险。在模拟现实世界条件下的初步测试表明,该系统有潜力提供可靠的情感洞察力,以提高诊断的准确性。这项工作突出了自动多模态分析作为传统心理评估实践的一个有价值的补充,在临床和治疗环境中的应用的承诺。
摘要:Traditional psychological evaluations rely heavily on human observation and interpretation, which are prone to subjectivity, bias, fatigue, and inconsistency. To address these limitations, this work presents a multimodal emotion recognition system that provides a standardised, objective, and data-driven tool to support evaluators, such as psychologists, psychiatrists, and clinicians. The system integrates recognition of facial expressions, speech, spoken language, and body movement analysis to capture subtle emotional cues that are often overlooked in human evaluations. By combining these modalities, the system provides more robust and comprehensive emotional state assessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in a simulated real-world condition demonstrates the system's potential to provide reliable emotional insights to improve the diagnostic accuracy. This work highlights the promise of automated multimodal analysis as a valuable complement to traditional psychological evaluation practices, with applications in clinical and therapeutic settings.