本文经arXiv每日学术速递授权转载
【1】 Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
标题:Freeze-Omni:具有Frozen LLM的智能、低延迟的语音对话模型
链接:https://arxiv.org/abs/2411.00774
作者:Xiong Wang, Yangze Li, Chaoyou Fu, Lei Xie, Ke Li, Xing Sun, Long Ma
备注:Project Page: this https URL
摘要:大语言模型的快速发展带来了许多新的智能应用,特别是GPT-4 o中出色的多模态人机交互,给用户带来了令人印象深刻的体验。在此背景下,近年来研究者们提出了许多能够实现语音到语音对话的多模态LLM。在本文中,我们提出了一种名为Freeze-Omni的语音-文本多模式LLM架构。我们的主要贡献是语音输入和输出模态可以连接到LLM,同时在整个训练过程中保持LLM冻结。我们为语音输入和输出的建模设计了3阶段训练策略,使Freeze-Omni能够在8个GPU上使用文本-语音配对数据(如ASR和TTS数据)和仅60,000个多轮文本问答数据获得语音到语音对话能力。此外,我们可以有效地确保Freeze-Omni在语音模态下的智能与其骨干LLM的文本模态下的智能处于同一水平,而口语响应的端到端延迟达到较低水平。此外,我们还设计了一种通过多任务训练实现双工对话能力的方法,使得Freeze-Omni在用户之间具有更自然的对话能力。Freeze-Omni主要为研究者提供了在LLM冻结的条件下进行多模态LLM的可能性,避免了由于数据和训练资源较少而导致的LLM灾难性遗忘所带来的各种影响。
摘要:The rapid development of large language models has brought many new smart
applications, especially the excellent multimodal human-computer interaction in
GPT-4o has brought impressive experience to users. In this background,
researchers have proposed many multimodal LLMs that can achieve
speech-to-speech dialogue recently. In this paper, we propose a speech-text
multimodal LLM architecture called Freeze-Omni. Our main contribution is the
speech input and output modalities can connected to the LLM while keeping the
LLM frozen throughout the training process. We designed 3-stage training
strategies both for the modeling of speech input and output, enabling
Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech
paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A
data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of
the Freeze-Omni in the speech modality is at the same level compared with that
in the text modality of its backbone LLM, while the end-to-end latency of the
spoken response achieves a low level. In addition, we also designed a method to
achieve duplex dialogue ability through multi-task training, making Freeze-Omni
have a more natural style of dialogue ability between the users. Freeze-Omni
mainly provides a possibility for researchers to conduct multimodal LLM under
the condition of a frozen LLM, avoiding various impacts caused by the
catastrophic forgetting of LLM caused by fewer data and training resources.
标题:声学和语言数据的多模式信息融合用于动物福利评估中奶牛叫声解码
链接:https://arxiv.org/abs/2411.00477
备注:31 pages, 22 figures, 2 tables
摘要:通过多源数据融合了解动物发声对于评估情绪状态和提高精准畜牧业中的动物福利至关重要。本研究旨在解码奶牛接触呼叫采用多模态数据融合技术,集成转录,语义分析,上下文和情感评估,声学特征提取。我们利用自然语言处理模型将奶牛发声的音频记录转录成书面形式。通过融合多个声学特征频率,持续时间和强度与转录的文本数据,我们开发了一个全面的表示奶牛发声。利用定制开发的本体内的数据融合,我们将发声分类为与痛苦或唤醒相关的高频呼叫和与满足或平静相关的低频呼叫。分析融合的多维数据,我们确定了焦虑相关的功能,包括特定的频率测量和声谱结果的情绪困扰的指示。评估20头奶牛发声的情感和声学特征,使我们能够确定呼叫模式和情绪状态的差异。采用先进的机器学习算法,随机森林,支持向量机和递归神经网络,我们有效地处理和融合多源数据来分类奶牛发声。这些模型经过优化,以应对实际农场环境中固有的计算需求和数据质量挑战。我们的研究结果证明了多源数据融合和智能处理技术在动物福利监测中的有效性。这项研究代表了动物福利评估的重大进展,突出了创新融合技术在理解和改善奶牛情感健康方面的作用。
摘要:Understanding animal vocalizations through multi-source data fusion is crucial for assessing emotional states and enhancing animal welfare in precision livestock farming. This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques, integrating transcription, semantic analysis, contextual and emotional assessment, and acoustic feature extraction. We utilized the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form. By fusing multiple acoustic features frequency, duration, and intensity with transcribed textual data, we developed a comprehensive representation of cow vocalizations. Utilizing data fusion within a custom-developed ontology, we categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness. Analyzing the fused multi dimensional data, we identified anxiety related features indicative of emotional distress, including specific frequency measurements and sound spectrum results. Assessing the sentiment and acoustic features of vocalizations from 20 individual cows allowed us to determine differences in calling patterns and emotional states. Employing advanced machine learning algorithms, Random Forest, Support Vector Machine, and Recurrent Neural Networks, we effectively processed and fused multi-source data to classify cow vocalizations. These models were optimized to handle computational demands and data quality challenges inherent in practical farm environments. Our findings demonstrate the effectiveness of multi-source data fusion and intelligent processing techniques in animal welfare monitoring. This study represents a significant advancement in animal welfare assessment, highlighting the role of innovative fusion technologies in understanding and improving the emotional wellbeing of dairy cows.
标题:MIRFFLEX:用于提取的音乐信息检索特征库
链接:https://arxiv.org/abs/2411.00469
备注:2 pages, 4 tables, submitted to Extended Abstracts for the Late-Breaking Demo Session of the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024
摘要:本文介绍了一个可扩展的模块化系统,编制了一系列的音乐特征提取模型,以帮助音乐信息检索研究。这些功能包括音乐元素,如基调,强拍和流派,以及音频特征,如乐器识别,人声/乐器分类和人声性别检测。集成模型是最先进的或最新的开源。这些特征可以被提取为潜在的或后处理的标签,从而能够集成到音乐应用中,例如生成音乐、推荐和播放列表生成。模块化设计使新开发的系统易于集成,使其成为一个很好的基准测试和比较工具。这个多功能的工具包通过提供具体的音乐功能来支持研究社区开发创新的解决方案。
摘要:This paper introduces an extendable modular system that compiles a range of music feature extraction models to aid music information retrieval research. The features include musical elements like key, downbeats, and genre, as well as audio characteristics like instrument recognition, vocals/instrumental classification, and vocals gender detection. The integrated models are state-of-the-art or latest open-source. The features can be extracted as latent or post-processed labels, enabling integration into music applications such as generative music, recommendation, and playlist generation. The modular design allows easy integration of newly developed systems, making it a good benchmarking and comparison tool. This versatile toolkit supports the research community in developing innovative solutions by providing concrete musical features.
标题:MEDCTCodec:一种面向高采样率和低比特率场景的轻量级基于IDT的神经音频编解码器
链接:https://arxiv.org/abs/2411.00464
备注:Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT2024)
摘要:在本文中,我们提出了MDCT编解码器,一个有效的轻量级端到端的神经音频编解码器的基础上修改的离散余弦变换(MDCT)。编码器将音频的MDCT频谱作为输入,将其编码为连续的潜在码,然后由残差矢量量化器(RVQ)将其离散化。随后,解码器从量化的潜码中解码MDCT频谱,并通过逆MDCT重建音频。在训练阶段,采用一种新的基于多分辨率MDCT的训练器(MR-MDCTD)来区分自然或解码的MDCT频谱用于对抗训练。实验结果表明,在高采样率和低比特率的场景中,MDCTCodec与基准编解码器相比,具有较高的解码音频质量、更高的训练和生成效率以及更紧凑的模型大小。具体而言,MDCTCodec在公共VCTK语料库上以48 kHz的采样率和6 kbps的比特率实现了4.18的ViSQOL分数。
摘要:In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.
标题:VCE:利用音频评估音频字幕系统
链接:https://arxiv.org/abs/2411.00321
摘要:自动音频字幕(AAC)任务旨在使用自然语言描述音频信号。为了评估机器生成的字幕,这些指标应该考虑音频事件、声学场景、语言学、信号特征和其他音频信息。传统的AAC评估依赖于自然语言生成指标,如ROUGE和BLEU,图像字幕指标,如SPICE和CIDER,或句子BERT嵌入相似性。然而,这些指标仅将生成的字幕与人类参考进行比较,而忽略了音频信号本身。在这项工作中,我们提出了MACE(多模态音频字幕评价),一种新的度量,集成了音频和参考字幕全面的音频字幕评价。MACE包含来自音频的音频信息以及预测和参考字幕,并使用流畅度惩罚对其进行加权。我们的实验表明,MACE的优越性能,在预测人类的质量判断相比,传统的指标。具体而言,在AudioCaps-Eval和Clotho-Eval数据集上,MACE相对于FENSE指标的相对准确度分别提高了3.28%和4.36%。此外,它显着优于所有以前的指标上的音频字幕评估任务。该指标在https://github.com/satvik-dixit/mace上开源
摘要:The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at https://github.com/satvik-dixit/mace
标题:利用先进的机器学习技术改进乐器分类
链接:https://arxiv.org/abs/2411.00275
备注:43 pages, 35 figures, 14 tables
摘要:乐器分类是音乐信息检索中的一个关键领域,由于其在教育、数字音乐制作和消费媒体中的应用而引起了人们的极大兴趣。机器学习的最新进展,特别是深度学习,增强了从音频信号中识别和分类乐器的能力。这项研究应用了各种机器学习方法,包括朴素贝叶斯、支持向量机、随机森林、AdaBoost和XGBoost等Boosting技术,以及卷积神经网络和人工神经网络等深度学习模型。这些方法的有效性进行评估的NSynth数据集,一个大型的库注释的音乐声音。通过比较这些方法,分析旨在展示每种方法的优点和局限性,为开发更准确和有效的分类系统提供指导。此外,混合模型测试和讨论包括在内。本研究旨在通过提出新的方法和未来的研究方向,为乐器分类的进一步研究提供支持。
摘要:Musical instrument classification, a key area in Music Information Retrieval, has gained considerable interest due to its applications in education, digital music production, and consumer media. Recent advances in machine learning, specifically deep learning, have enhanced the capability to identify and classify musical instruments from audio signals. This study applies various machine learning methods, including Naive Bayes, Support Vector Machines, Random Forests, Boosting techniques like AdaBoost and XGBoost, as well as deep learning models such as Convolutional Neural Networks and Artificial Neural Networks. The effectiveness of these methods is evaluated on the NSynth dataset, a large repository of annotated musical sounds. By comparing these approaches, the analysis aims to showcase the advantages and limitations of each method, providing guidance for developing more accurate and efficient classification systems. Additionally, hybrid model testing and discussion are included. This research aims to support further studies in instrument classification by proposing new approaches and future research directions.
标题:使用MFCC、色度、光谱对比度和时间特征工程的基于音频的内容评估的机器学习框架
链接:https://arxiv.org/abs/2411.00195
备注:6 pages, 6 figures
摘要:这项研究提出了一个机器学习框架,用于评估音频内容之间的相似性和预测情感得分。我们构建了一个数据集,其中包含来自YouTube上音乐封面的音频样本以及原始歌曲的音频,以及来自用户评论的情感评分,作为内容质量的代理标签。我们的方法涉及广泛的预处理,分割音频信号到30秒的窗口,并通过梅尔频率倒谱系数(MFCC),色度,频谱对比度和时间特征提取高维特征表示。利用这些特征,我们训练回归模型来预测0-100范围内的情感得分,分别实现3.420、5.482、2.783和4.212的均方根误差(RMSE)值。观察到基于绝对差异度量的基线模型的改进。这些结果证明了机器学习在捕获音频中的情感和相似性方面的潜力,为媒体分析中的AI应用提供了一个适应性强的框架。
摘要:This study presents a machine learning framework for assessing similarity between audio content and predicting sentiment score. We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments, serving as proxy labels for content quality. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations through Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, and Temporal characteristics. Leveraging these features, we train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively. Improvements over a baseline model based on absolute difference metrics are observed. These results demonstrate the potential of machine learning to capture sentiment and similarity in audio, offering an adaptable framework for AI applications in media analysis.
标题:音频分类的角距离分布损失
链接:https://arxiv.org/abs/2411.00153
摘要:分类是深度学习中的一项关键任务,不仅因为其内在的重要性,还因为它可以在其他任务中提供具有所需属性的嵌入。为了优化这些属性,已经提出了各种各样的损失函数,试图在嵌入空间中最小化类内距离并最大化类间距离。在本文中,我们认为,除了这两个,消除类内和类间的层次结构是分类嵌入的另外两个理想的属性。此外,我们提出了角距离分布(ADD)损失,其目的是联合增强前面的四个属性。为此,它施加的条件的第一和第二阶统计矩的嵌入之间的角距离。最后,我们进行了实验,表明我们的损失函数改善了所有四个属性,因此,在音频分类任务中比其他损失函数表现得更好。
摘要:Classification is a pivotal task in deep learning not only because of its intrinsic importance, but also for providing embeddings with desirable properties in other tasks. To optimize these properties, a wide variety of loss functions have been proposed that attempt to minimize the intra-class distance and maximize the inter-class distance in the embeddings space. In this paper we argue that, in addition to these two, eliminating hierarchies within and among classes are two other desirable properties for classification embeddings. Furthermore, we propose the Angular Distance Distribution (ADD) Loss, which aims to enhance the four previous properties jointly. For this purpose, it imposes conditions on the first and second order statistical moments of the angular distance between embeddings. Finally, we perform experiments showing that our loss function improves all four properties and, consequently, performs better than other loss functions in audio classification tasks.
标题:我能听到你:Deepfake音频检测的选择性稳健训练
链接:https://arxiv.org/abs/2411.00121
摘要:人工智能生成的声音的最新进展加剧了检测deepfake音频的挑战,为诈骗和虚假信息的传播带来了风险。为了解决这个问题,我们建立了迄今为止最大的公共语音数据集,名为DeepFakeVox-HQ,包含130万个样本,其中包括来自14个不同来源的27万个高质量deepfake样本。尽管之前报道的准确率很高,但现有的deepfake语音检测器与我们收集的数据集很难相处,并且在现实腐败和对抗性攻击下,它们的检测成功率进一步下降。我们对增强模型鲁棒性的因素进行了全面的调查,并表明采用多样化的语音增强是有益的。此外,我们发现最好的检测模型通常依赖于高频特征,这些特征对人类来说是不可感知的,并且很容易被攻击者操纵。为了解决这个问题,我们提出了F-SAT:专注于高频分量的频率选择性对抗训练方法。实证结果表明,使用我们的训练数据集将基线模型性能(没有鲁棒训练)提高了33%,并且我们的鲁棒训练在干净样本上进一步提高了7.7%的准确性,在最先进的RawNet 3模型上提高了29.3%的准确性。
摘要:Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
标题:ISCSLP 2024年对话语音克隆(CoVoC)挑战:任务、结果和发现
链接:https://arxiv.org/abs/2411.00064
备注:accepted by ISCSLP 2024
摘要:ISCSLP 2024会话语音克隆(CoVoC)挑战赛旨在基准测试和推进zero-shot自发式语音克隆,特别关注在会话语音中生成自发行为。该挑战包括两个轨道:一个不受约束的轨道,不限制数据和模型的使用,以及一个受约束的轨道,只允许使用受约束的开源数据集。一个100小时的高质量对话语音数据集也提供了挑战。本文详细介绍了数据,跟踪,提交的系统,评估结果和调查结果。
摘要:The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge aims to benchmark and advance zero-shot spontaneous style voice cloning, particularly focusing on generating spontaneous behaviors in conversational speech. The challenge comprises two tracks: an unconstrained track without limitation on data and model usage, and a constrained track only allowing the use of constrained open-source datasets. A 100-hour high-quality conversational speech dataset is also made available with the challenge. This paper details the data, tracks, submitted systems, evaluation results, and findings.
标题:使用大型语言模型进行后续对话的设备引导语音检测
链接:https://arxiv.org/abs/2411.00023
摘要:与虚拟助理(VA)的后续对话使用户能够与VA无缝交互,而无需使用关键字(在第一次查询之后)重复调用它。因此,从后续查询中准确的设备定向语音检测(DDSD)对于实现自然的用户体验至关重要。为此,我们探索了大型语言模型(LLM)的概念,并在对后续内容进行推断时(基于ASR解码的文本),通过提示预训练的LLM或通过在LLM之上调整二元分类器来对第一个查询进行建模。在这样做时,我们还利用ASR的不确定性时,设计LLM提示。我们在后续对话的真实数据集上显示,与单独建模后续对话相比,由于先前语音上下文和ASR不确定性的联合建模,这种方法产生了很大的收益(在10%的固定错误拒绝下,错误警报减少了20-40%)。
摘要:Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
标题:SAN-PSZ:用于头轨个人音区的空间自适应神经网络
链接:https://arxiv.org/abs/2411.00772
备注:This work has been submitted to the IEEE for possible publication
摘要:提出了一种用于动态渲染具有头部跟踪的个人声音区域(PSZ)的深度学习框架,其利用空间自适应神经网络(SANN)输入收听者的头部坐标并输出PSZ滤波器系数。SANN模型使用具有数据增强的模拟声学传递函数(ATF)进行训练,以在不确定的环境中实现鲁棒性,或者使用模拟和测量ATF的混合进行训练,以在已知条件下进行定制。结果发现,在训练数据中增加房间反射可以更有效地提高模型的鲁棒性比增加系统的缺陷,并增加限制,如滤波器的紧凑性的损失函数不会显着影响模型的性能。与传统的滤波器设计方法的最佳性能的模型的比较表明,当没有测量的ATF是可用的,该模型产生相等或更高的隔离在一个实际的房间环境中较少的过滤器文物。此外,与传统方法相比,该模型实现了显着的数据压缩(100倍)和计算效率(10倍),使其适合于实时渲染PSZ,以适应听众的头部运动。
摘要:A deep learning framework for dynamically rendering personal sound zones (PSZs) with head tracking is presented, utilizing a spatially adaptive neural network (SANN) that inputs listeners' head coordinates and outputs PSZ filter coefficients. The SANN model is trained using either simulated acoustic transfer functions (ATFs) with data augmentation for robustness in uncertain environments or a mix of simulated and measured ATFs for customization under known conditions. It is found that augmenting room reflections in the training data can more effectively improve the model robustness than augmenting the system imperfections, and that adding constraints such as filter compactness to the loss function does not significantly affect the model's performance. Comparisons of the best-performing model with traditional filter design methods show that, when no measured ATFs are available, the model yields equal or higher isolation in an actual room environment with fewer filter artifacts. Furthermore, the model achieves significant data compression (100x) and computational efficiency (10x) compared to the traditional methods, making it suitable for real-time rendering of PSZs that adapt to the listeners' head movements.
标题:使用载体量化优化上下文语音识别以实现高效检索
链接:https://arxiv.org/abs/2411.00664
备注:14 pages, 7 figures, submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
摘要:神经上下文偏置允许语音识别模型利用上下文相关信息,从而提高转录准确性。然而,偏置机制通常基于音频和偏置条目目录之间的交叉注意模块,这意味着计算复杂性可能对偏置目录的大小造成严重的实际限制,从而对准确性的提高造成严重的实际限制。这项工作提出了一个近似的交叉注意力评分的基础上矢量量化,并使计算和内存有效地使用大型偏置目录。我们建议使用这种技术与检索为基础的上下文偏置方法。首先,我们使用一个有效的量化检索模块,通过将它们接地在音频上来筛选偏置条目。然后我们使用检索到的条目进行偏置。由于所提出的方法是不可知的偏置方法,我们调查使用充分的交叉注意,LLM提示,以及两者的组合。我们发现,基于检索的入围名单允许系统有效地利用几千个条目的偏置目录,导致个人实体识别的相对错误率降低高达71%。与此同时,所提出的近似算法减少了20%的计算时间和85- 95%的内存使用量,高达一百万个条目的列表相比,标准的点积交叉注意。
摘要:Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.
标题:使用大型语言模型进行后续对话的设备引导语音检测
链接:https://arxiv.org/abs/2411.00023
摘要:与虚拟助理(VA)的后续对话使用户能够与VA无缝交互,而无需使用关键字(在第一次查询之后)重复调用它。因此,从后续查询中准确的设备定向语音检测(DDSD)对于实现自然的用户体验至关重要。为此,我们探索了大型语言模型(LLM)的概念,并在对后续内容进行推断时(基于ASR解码的文本),通过提示预训练的LLM或通过在LLM之上调整二元分类器来对第一个查询进行建模。在这样做时,我们还利用ASR的不确定性时,设计LLM提示。我们在后续对话的真实数据集上显示,与单独建模后续对话相比,由于先前语音上下文和ASR不确定性的联合建模,这种方法产生了很大的收益(在10%的固定错误拒绝下,错误警报减少了20-40%)。
摘要:Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
标题:Freeze-Omni:具有Frozen LLM的智能、低延迟的语音对话模型
链接:https://arxiv.org/abs/2411.00774
备注:Project Page: this https URL
摘要:大语言模型的快速发展带来了许多新的智能应用,特别是GPT-4 o中出色的多模态人机交互,给用户带来了令人印象深刻的体验。在此背景下,近年来研究者们提出了许多能够实现语音到语音对话的多模态LLM。在本文中,我们提出了一个语音-文本多模态LLM架构称为Freeze-Omni。我们的主要贡献是语音输入和输出模态可以连接到LLM,同时在整个训练过程中保持LLM冻结。我们为语音输入和输出的建模设计了3阶段训练策略,使Freeze-Omni能够在8个GPU上使用文本-语音配对数据(如ASR和TTS数据)和仅60,000个多轮文本问答数据获得语音到语音对话能力。此外,我们可以有效地确保Freeze-Omni在语音模态下的智能与其骨干LLM的文本模态下的智能处于同一水平,而口语响应的端到端延迟达到较低水平。此外,我们还设计了一种通过多任务训练实现双工对话能力的方法,使得Freeze-Omni在用户之间具有更自然的对话能力。Freeze-Omni主要为研究者提供了在LLM冻结的条件下进行多模态LLM的可能性,避免了由于数据和训练资源较少而导致的LLM灾难性遗忘所带来的各种影响。
摘要:The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
标题:声学和语言数据的多模式信息融合用于动物福利评估中奶牛叫声解码
链接:https://arxiv.org/abs/2411.00477
备注:31 pages, 22 figures, 2 tables
摘要:通过多源数据融合了解动物发声对于评估情绪状态和提高精准畜牧业中的动物福利至关重要。本研究旨在解码奶牛接触呼叫采用多模态数据融合技术,集成转录,语义分析,上下文和情感评估,声学特征提取。我们利用自然语言处理模型将奶牛发声的音频记录转录成书面形式。通过融合多个声学特征频率,持续时间和强度与转录的文本数据,我们开发了一个全面的表示奶牛发声。利用定制开发的本体内的数据融合,我们将发声分类为与痛苦或唤醒相关的高频呼叫和与满足或平静相关的低频呼叫。分析融合的多维数据,我们确定了焦虑相关的功能,包括特定的频率测量和声谱结果的情绪困扰的指示。评估20头奶牛发声的情感和声学特征,使我们能够确定呼叫模式和情绪状态的差异。采用先进的机器学习算法,随机森林,支持向量机和递归神经网络,我们有效地处理和融合多源数据来分类奶牛发声。这些模型经过优化,以应对实际农场环境中固有的计算需求和数据质量挑战。我们的研究结果证明了多源数据融合和智能处理技术在动物福利监测中的有效性。这项研究代表了动物福利评估的重大进展,突出了创新融合技术在理解和改善奶牛情感健康方面的作用。
摘要:Understanding animal vocalizations through multi-source data fusion is crucial for assessing emotional states and enhancing animal welfare in precision livestock farming. This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques, integrating transcription, semantic analysis, contextual and emotional assessment, and acoustic feature extraction. We utilized the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form. By fusing multiple acoustic features frequency, duration, and intensity with transcribed textual data, we developed a comprehensive representation of cow vocalizations. Utilizing data fusion within a custom-developed ontology, we categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness. Analyzing the fused multi dimensional data, we identified anxiety related features indicative of emotional distress, including specific frequency measurements and sound spectrum results. Assessing the sentiment and acoustic features of vocalizations from 20 individual cows allowed us to determine differences in calling patterns and emotional states. Employing advanced machine learning algorithms, Random Forest, Support Vector Machine, and Recurrent Neural Networks, we effectively processed and fused multi-source data to classify cow vocalizations. These models were optimized to handle computational demands and data quality challenges inherent in practical farm environments. Our findings demonstrate the effectiveness of multi-source data fusion and intelligent processing techniques in animal welfare monitoring. This study represents a significant advancement in animal welfare assessment, highlighting the role of innovative fusion technologies in understanding and improving the emotional wellbeing of dairy cows.
标题:MIRFFLEX:用于提取的音乐信息检索特征库
链接:https://arxiv.org/abs/2411.00469
备注:2 pages, 4 tables, submitted to Extended Abstracts for the Late-Breaking Demo Session of the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024
摘要:本文介绍了一个可扩展的模块化系统,编制了一系列的音乐特征提取模型,以帮助音乐信息检索研究。这些功能包括音乐元素,如基调,强拍和流派,以及音频特征,如乐器识别,人声/乐器分类和人声性别检测。集成模型是最先进的或最新的开源。这些特征可以被提取为潜在的或后处理的标签,从而能够集成到音乐应用中,例如生成音乐、推荐和播放列表生成。模块化设计使新开发的系统易于集成,使其成为一个很好的基准测试和比较工具。这个多功能的工具包通过提供具体的音乐功能来支持研究社区开发创新的解决方案。
摘要:This paper introduces an extendable modular system that compiles a range of music feature extraction models to aid music information retrieval research. The features include musical elements like key, downbeats, and genre, as well as audio characteristics like instrument recognition, vocals/instrumental classification, and vocals gender detection. The integrated models are state-of-the-art or latest open-source. The features can be extracted as latent or post-processed labels, enabling integration into music applications such as generative music, recommendation, and playlist generation. The modular design allows easy integration of newly developed systems, making it a good benchmarking and comparison tool. This versatile toolkit supports the research community in developing innovative solutions by providing concrete musical features.
标题:MEDCTCodec:一种面向高采样率和低比特率场景的轻量级基于IDT的神经音频编解码器
链接:https://arxiv.org/abs/2411.00464
备注:Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT2024)
摘要:在本文中,我们提出了MDCT编解码器,一个有效的轻量级端到端的神经音频编解码器的基础上修改的离散余弦变换(MDCT)。编码器将音频的MDCT频谱作为输入,将其编码为连续的潜在码,然后由残差矢量量化器(RVQ)将其离散化。随后,解码器从量化的潜码中解码MDCT频谱,并通过逆MDCT重建音频。在训练阶段,采用一种新的基于多分辨率MDCT的训练器(MR-MDCTD)来区分自然或解码的MDCT频谱用于对抗训练。实验结果证实,在高采样率和低比特率的场景中,与基线编解码器相比,MDCTCodec表现出高解码音频质量、提高的训练和生成效率以及紧凑的模型大小。具体而言,MDCTCodec在公共VCTK语料库上以48 kHz的采样率和6 kbps的比特率实现了4.18的ViSQOL分数。
摘要:In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.
标题:VCE:利用音频评估音频字幕系统
链接:https://arxiv.org/abs/2411.00321
摘要:自动音频字幕(AAC)任务旨在使用自然语言描述音频信号。为了评估机器生成的字幕,这些指标应该考虑音频事件、声学场景、语言学、信号特征和其他音频信息。传统的AAC评估依赖于自然语言生成指标,如ROUGE和BLEU,图像字幕指标,如SPICE和CIDER,或句子BERT嵌入相似性。然而,这些指标仅将生成的字幕与人类参考进行比较,而忽略了音频信号本身。在这项工作中,我们提出了MACE(多模态音频字幕评价),一种新的度量,集成了音频和参考字幕全面的音频字幕评价。MACE结合了来自音频的音频信息以及预测和参考字幕,并使用流畅性惩罚对其进行加权。我们的实验表明,MACE的优越性能,在预测人类的质量判断相比,传统的指标。具体而言,在AudioCaps-Eval和Clotho-Eval数据集上,MACE相对于FENSE指标的相对准确度分别提高了3.28%和4.36%。此外,它显着优于所有以前的指标上的音频字幕评估任务。该指标在https://github.com/satvik-dixit/mace上开源
摘要:The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at https://github.com/satvik-dixit/mace
标题:利用先进的机器学习技术改进乐器分类
链接:https://arxiv.org/abs/2411.00275
备注:43 pages, 35 figures, 14 tables
摘要:乐器分类是音乐信息检索中的一个关键领域,由于其在教育、数字音乐制作和消费媒体中的应用而引起了人们的极大兴趣。机器学习的最新进展,特别是深度学习,增强了从音频信号中识别和分类乐器的能力。这项研究应用了各种机器学习方法,包括朴素贝叶斯、支持向量机、随机森林、AdaBoost和XGBoost等Boosting技术,以及卷积神经网络和人工神经网络等深度学习模型。这些方法的有效性进行评估的NSynth数据集,一个大型的库注释的音乐声音。通过比较这些方法,分析旨在展示每种方法的优点和局限性,为开发更准确,更有效的分类系统提供指导。此外,混合模型测试和讨论包括在内。本研究旨在通过提出新的方法和未来的研究方向,为乐器分类的进一步研究提供支持。
摘要:Musical instrument classification, a key area in Music Information Retrieval, has gained considerable interest due to its applications in education, digital music production, and consumer media. Recent advances in machine learning, specifically deep learning, have enhanced the capability to identify and classify musical instruments from audio signals. This study applies various machine learning methods, including Naive Bayes, Support Vector Machines, Random Forests, Boosting techniques like AdaBoost and XGBoost, as well as deep learning models such as Convolutional Neural Networks and Artificial Neural Networks. The effectiveness of these methods is evaluated on the NSynth dataset, a large repository of annotated musical sounds. By comparing these approaches, the analysis aims to showcase the advantages and limitations of each method, providing guidance for developing more accurate and efficient classification systems. Additionally, hybrid model testing and discussion are included. This research aims to support further studies in instrument classification by proposing new approaches and future research directions.
标题:使用MFCC、色度、光谱对比度和时间特征工程的基于音频的内容评估的机器学习框架
链接:https://arxiv.org/abs/2411.00195
备注:6 pages, 6 figures
摘要:这项研究提出了一个机器学习框架,用于评估音频内容之间的相似性和预测情感得分。我们构建了一个数据集,其中包含来自YouTube上音乐封面的音频样本以及原始歌曲的音频,以及来自用户评论的情感评分,作为内容质量的代理标签。我们的方法涉及广泛的预处理,分割音频信号到30秒的窗口,并通过梅尔频率倒谱系数(MFCC),色度,频谱对比度和时间特征提取高维特征表示。利用这些特征,我们训练回归模型来预测0-100范围内的情感得分,分别实现3.420、5.482、2.783和4.212的均方根误差(RMSE)值。观察到基于绝对差异度量的基线模型的改进。这些结果证明了机器学习在捕获音频中的情感和相似性方面的潜力,为媒体分析中的AI应用提供了一个适应性强的框架。
摘要:This study presents a machine learning framework for assessing similarity between audio content and predicting sentiment score. We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments, serving as proxy labels for content quality. Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations through Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, and Temporal characteristics. Leveraging these features, we train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively. Improvements over a baseline model based on absolute difference metrics are observed. These results demonstrate the potential of machine learning to capture sentiment and similarity in audio, offering an adaptable framework for AI applications in media analysis.
标题:音频分类的角距离分布损失
链接:https://arxiv.org/abs/2411.00153
摘要:分类是深度学习中的一项关键任务,不仅因为其内在的重要性,还因为它可以在其他任务中提供具有所需属性的嵌入。为了优化这些属性,已经提出了各种各样的损失函数,试图在嵌入空间中最小化类内距离并最大化类间距离。在本文中,我们认为,除了这两个,消除类内和类间的层次结构是分类嵌入的另外两个理想的属性。此外,我们提出了角距离分布(ADD)损失,其目的是联合增强前面的四个属性。为此,它施加的条件的第一和第二阶统计矩的嵌入之间的角距离。最后,我们进行的实验表明,我们的损失函数改善了所有四个属性,因此在音频分类任务中比其他损失函数表现得更好。
摘要:Classification is a pivotal task in deep learning not only because of its intrinsic importance, but also for providing embeddings with desirable properties in other tasks. To optimize these properties, a wide variety of loss functions have been proposed that attempt to minimize the intra-class distance and maximize the inter-class distance in the embeddings space. In this paper we argue that, in addition to these two, eliminating hierarchies within and among classes are two other desirable properties for classification embeddings. Furthermore, we propose the Angular Distance Distribution (ADD) Loss, which aims to enhance the four previous properties jointly. For this purpose, it imposes conditions on the first and second order statistical moments of the angular distance between embeddings. Finally, we perform experiments showing that our loss function improves all four properties and, consequently, performs better than other loss functions in audio classification tasks.
标题:我能听到你:Deepfake音频检测的选择性稳健训练
链接:https://arxiv.org/abs/2411.00121
摘要:人工智能生成的声音的最新进展加剧了检测deepfake音频的挑战,为诈骗和虚假信息的传播带来了风险。为了解决这个问题,我们建立了迄今为止最大的公共语音数据集,名为DeepFakeVox-HQ,包含130万个样本,其中包括来自14个不同来源的27万个高质量deepfake样本。尽管之前报道的准确率很高,但现有的deepfake语音检测器与我们收集的数据集很难相处,并且在现实腐败和对抗性攻击下,它们的检测成功率进一步下降。我们对增强模型鲁棒性的因素进行了全面的调查,并表明采用多样化的语音增强是有益的。此外,我们发现最好的检测模型通常依赖于高频特征,这些特征对人类来说是不可感知的,并且很容易被攻击者操纵。为了解决这个问题,我们提出了F-SAT:专注于高频分量的频率选择性对抗训练方法。实证结果表明,使用我们的训练数据集将基线模型性能(没有鲁棒训练)提高了33%,并且我们的鲁棒训练在干净样本上进一步提高了7.7%的准确性,在最先进的RawNet 3模型上提高了29.3%的准确性。
摘要:Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.