本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2412.19351
摘要:近年来,文本到音频(TTA)合成取得了重大进展,使用户能够通过从自然语言提示生成的合成音频来丰富他们的创意工作流程。尽管取得了这一进展,但数据、模型架构、训练目标函数和采样策略对目标基准的影响还没有得到很好的理解。为了全面了解TTA模型的设计空间,我们建立了一个大规模的经验实验,重点是扩散和流量匹配模型。我们的贡献包括:1)AF-Synthetic,从音频理解模型中获得的高质量合成字幕的大型数据集; 2)TTA模型的不同架构、训练和推理设计选择的系统比较; 3)采样方法及其帕累托曲线分析相对于生成质量和推理速度。我们利用从这种广泛的分析中获得的知识,提出了我们最好的模型,称为阐明文本到音频(ETTA)。当在AudioCaps和MusicCaps上进行评估时,ETTA提供了对公开数据训练的基线的改进,同时与专有数据训练的模型竞争。最后,我们展示了ETTA在复杂和富有想象力的字幕之后生成创意音频的改进能力-这是一项比当前基准更具挑战性的任务。
摘要:Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.
标题: 提高人工智能合成语音检测的通用性
链接:https://arxiv.org/abs/2412.19279
备注:AAAI25
摘要:人工智能合成语音技术有可能为有益的应用创造真实的人类声音,但它也可能被滥用于恶意目的。虽然现有的人工智能合成语音检测模型在域内评估方面表现出色,但它们在跨不同领域推广方面面临挑战,随着新的语音生成器的出现,它们可能会过时。当前的解决方案使用不同的数据和先进的机器学习技术(例如,域不变表示、自监督学习),但是受限于预定义的声码器以及对像背景噪声和说话者身份这样的因素的敏感性。在这项工作中,我们引入了一个创新的解纠缠框架,旨在提取域不可知的文物功能相关的声码器。利用这些功能,我们增强了平坦损失环境中的模型学习,从而能够避免次优解决方案并提高泛化能力。在基准测试上的大量实验表明,我们的方法优于最先进的方法,在域内的等错误率度量和跨域评估中分别实现了高达5.12%和7.59%的改进。
摘要:AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), but are limited by predefined vocoders and sensitivity to factors like background noise and speaker identity. In this work, we introduce an innovative disentanglement framework aimed at extracting domain-agnostic artifact features related to vocoders. Utilizing these features, we enhance model learning in a flat loss landscape, enabling escape from suboptimal solutions and improving generalization. Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.
标题: 基于双尺度注意力的元学习的个性化动态音乐情感识别
链接:https://arxiv.org/abs/2412.19200
备注:Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)
摘要:动态音乐情感识别(DMER)旨在预测音乐中不同时刻的情感,在音乐信息检索中起着至关重要的作用。现有的DMER方法在处理序列数据时难以捕获长期依赖性,这限制了它们的性能。此外,这些方法往往忽略了个体差异对情绪感知的影响,即使每个人都有自己的个性化情绪感知在现实世界中。出于这些问题的动机,我们探索更有效的序列处理方法,并引入个性化DMER(PDMER)问题,这需要模型来预测与个性化感知一致的情绪。具体来说,我们提出了一种基于双尺度注意力的元学习(DSAML)方法。该方法融合了双尺度特征提取器的特征,并使用双尺度注意力Transformer捕获短期和长期依赖关系,提高了传统DMER的性能。为了实现PDMER,我们设计了一种新的任务构造策略,划分任务的注释。任务中的样本由相同的注释器注释,确保一致的感知。利用这种策略和元学习,DSAML可以预测个性化的情绪感知,只需一个个性化的注释样本。我们的客观和主观实验表明,我们的方法可以实现国家的最先进的性能在传统的DMER和PDMER。
摘要:Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.
标题: 凝聚舞者:通过音乐驱动的凝聚力分解增强互动群舞生成
链接:https://arxiv.org/abs/2412.19123
摘要:舞蹈生成是至关重要和具有挑战性的,特别是在舞蹈表演和虚拟游戏等领域。在目前的文献中,大多数方法论都集中在独奏音乐2Dance上。虽然有针对集体音乐舞蹈的努力,但这些努力往往缺乏连贯性,导致舞蹈表演美学上的缺陷。因此,我们介绍Cohedancers,一个新的框架,音乐驱动的互动群舞生成。Cohedancers旨在通过将其分解为三个关键方面来增强集体舞的连贯性:同步性,自然性和流动性。相应地,我们开发了基于周期一致性的舞蹈同步策略来促进音乐与舞蹈的对应,基于自回归的暴露偏差纠正策略来增强生成舞蹈的流动性,以及对抗性训练策略来增强自然性团体舞输出。总的来说,这些策略使CohdeDancers能够产生高质量的高度连贯的集体舞蹈。此外,为了为Group Music 2Dance建立更好的基准,我们构建了迄今为止最多元化、最全面的开源数据集I-Dancers,具有丰富的舞者互动功能,并创建了全面的评估指标。对I-Dancers和其他现存数据集的实验评估证实了CoheDancers实现了前所未有的最先进性能。将发布代码。
摘要:Dance generation is crucial and challenging, particularly in domains like dance performance and virtual gaming. In the current body of literature, most methodologies focus on Solo Music2Dance. While there are efforts directed towards Group Music2Dance, these often suffer from a lack of coherence, resulting in aesthetically poor dance performances. Thus, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. CoheDancers aims to enhance group dance generation coherence by decomposing it into three key aspects: synchronization, naturalness, and fluidity. Correspondingly, we develop a Cycle Consistency based Dance Synchronization strategy to foster music-dance correspondences, an Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity of the generated dances, and an Adversarial Training Strategy to augment the naturalness of the group dance output. Collectively, these strategies enable CohdeDancers to produce highly coherent group dances with superior quality. Furthermore, to establish better benchmarks for Group Music2Dance, we construct the most diverse and comprehensive open-source dataset to date, I-Dancers, featuring rich dancer interactions, and create comprehensive evaluation metrics. Experimental evaluations on I-Dancers and other extant datasets substantiate that CoheDancers achieves unprecedented state-of-the-art performance. Code will be released.
标题: BSDB-Net:带分裂双分支网络,具有选择性状态空间机制,用于单耳语音增强
链接:https://arxiv.org/abs/2412.19099
备注:Accepted by AAAI 2025
摘要:虽然基于复谱的语音增强(SE)方法已经取得了显著的性能,但是幅度和相位的耦合会导致补偿效应,其中牺牲幅度信息来补偿对SE有害的相位。此外,为了进一步提高SE的性能,许多模块堆叠在SE上,导致模型复杂度增加,限制了SE的应用。为了解决这些问题,提出了一种基于Mamba压缩频率的双路径网络。首先,通过并行双分支提取幅度和相位信息。该方法利用结构化复频谱隐式地捕获相位信息,并通过解耦幅度和相位来解决补偿效应,网络中加入交互模块来抑制不必要的部分并从另一分支恢复丢失的分量。其次,为了降低网络复杂度,该网络引入了频带分割策略来压缩频率维。为了在保持良好性能的同时进一步降低复杂度,我们设计了一个基于Mamba的模块,该模块在线性复杂度下对时间和频率维度进行建模。最后,与基准相比,我们的模型在保持卓越性能的同时,平均将计算复杂度降低了8.3倍。此外,与基于变换器的模型相比,它实现了25倍的复杂度降低。
摘要:Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models.
标题: 利用多语言STEN-TTC和Bert LID的印度尼西亚-英语代码转换语音合成器
链接:https://arxiv.org/abs/2412.19043
备注:Accepted at O-COCOSDA 2024
摘要:多语言文本到语音系统跨多种语言将文本转换为语音。在许多情况下,文本句子可能包含不同语言的片段,这种现象称为语码转换。这在印度尼西亚尤其常见,尤其是在印度尼西亚语和英语之间。尽管它的意义,还没有研究开发出一个多语种的TTS系统能够处理这两种语言之间的代码切换。本研究旨在探讨STEN-TTS中的汉英语码转换现象。关键的修改包括使用微调的BERT进行每个单词的语言识别,以及从基本模型中删除语言嵌入,将语言识别组件添加到文本到音素转换中。实验结果表明,代码转换模型实现了优越的自然度和改善的语音清晰度相比,印尼和英语基线STEN-TTS模型。
摘要:Multilingual text-to-speech systems convert text into speech across multiple languages. In many cases, text sentences may contain segments in different languages, a phenomenon known as code-switching. This is particularly common in Indonesia, especially between Indonesian and English. Despite its significance, no research has yet developed a multilingual TTS system capable of handling code-switching between these two languages. This study addresses Indonesian-English code-switching in STEN-TTS. Key modifications include adding a language identification component to the text-to-phoneme conversion using finetuned BERT for per-word language identification, as well as removing language embedding from the base model. Experimental results demonstrate that the code-switching model achieves superior naturalness and improved speech intelligibility compared to the Indonesian and English baseline STEN-TTS models.
标题: Leave-One-Equival:减轻对比音乐表示中与不变性相关的信息损失
链接:https://arxiv.org/abs/2412.18955
摘要:对比学习在自我监督的音乐表征学习中被证明是有效的,特别是对于音乐信息检索(MIR)任务。然而,依赖于增强链的对比视图生成和由此产生的学习不变性提出了挑战时,不同的下游任务需要对某些音乐属性的敏感性。为了解决这个问题,我们提出了Leave One Equivariant(LOEV)框架,与以前的工作相比,该框架通过选择性地保留有关特定增强的信息,引入了一种灵活的任务自适应方法,允许模型保持任务相关的等方差。我们证明了LOEV消除了与学习不变性相关的信息损失,提高了增强相关任务和检索的性能,而不会牺牲一般表示质量。此外,我们引入了LOEV的一个变体LOEV++,它通过设计以自监督的方式构建了一个解纠缠的潜在空间,并基于增强相关属性实现了有针对性的检索。
摘要:Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.
标题: 稳健的目标说话人到达方向估计
链接:https://arxiv.org/abs/2412.18913
摘要:在多说话人环境中,目标说话人的波达方向(DOA)是提高语音清晰度和提取目标说话人语音的关键。然而,传统的DOA估计方法往往在噪声、混响的存在下,特别是当存在竞争扬声器时,会遇到困难。为了解决这些挑战,我们提出了RTS-DOA,一个强大的实时DOA估计系统。该系统创新性地使用目标说话人的注册语音作为参考,并利用来自麦克风阵列的全频带和子频带频谱信息来估计目标说话人的语音的DOA。具体地,该系统包括用于初始改善语音质量的语音增强模块、用于学习空间信息的空间模块和用于提取声纹特征的说话者模块。LibriSpeech数据集上的实验结果表明,我们的RTS-DOA系统有效地处理多说话人的情况下,建立了新的最佳基准。
摘要:In multi-speaker environments the direction of arrival (DOA) of a target speaker is key for improving speech clarity and extracting target speaker's voice. However, traditional DOA estimation methods often struggle in the presence of noise, reverberation, and particularly when competing speakers are present. To address these challenges, we propose RTS-DOA, a robust real-time DOA estimation system. This system innovatively uses the registered speech of the target speaker as a reference and leverages full-band and sub-band spectral information from a microphone array to estimate the DOA of the target speaker's voice. Specifically, the system comprises a speech enhancement module for initially improving speech quality, a spatial module for learning spatial information, and a speaker module for extracting voiceprint features. Experimental results on the LibriSpeech dataset demonstrate that our RTS-DOA system effectively tackles multi-speaker scenarios and established new optimal benchmarks.
标题: 用于声学回声消除的注意力增强型短期维纳解决方案
链接:https://arxiv.org/abs/2412.18851
摘要:声学回声消除(AEC)是一种重要的语音信号处理技术,它可以消除麦克风输入的回声,以促进自然发声的全双工通信。目前,基于深度学习的AEC方法主要集中在细化模型架构,经常忽略传统过滤器理论的知识。本文提出了一种创新的方法,AEC通过引入注意力增强的短时维纳解决方案。我们的方法策略性地利用注意力机制来减轻双端谈话干扰的影响,从而优化知识利用的效率。短期维纳解决方案,它适应经典维纳解决方案的有限输入因果关系的推导,集成了从滤波器理论到这种方法建立的见解。实验结果证实了我们提出的方法的有效性,超过其他基线模型的性能和泛化。官方代码可在https://github.com/ZhaoF-i/ASTWS-AEC上获得
摘要:Acoustic Echo Cancellation (AEC) is an essential speech signal processing technology that removes echoes from microphone inputs to facilitate natural-sounding full-duplex communication. Currently, deep learning-based AEC methods primarily focus on refining model architectures, frequently neglecting the incorporation of knowledge from traditional filter theory. This paper presents an innovative approach to AEC by introducing an attention-enhanced short-time Wiener solution. Our method strategically harnesses attention mechanisms to mitigate the impact of double-talk interference, thereby optimizing the efficiency of knowledge utilization. The derivation of the short-term Wiener solution, which adapts classical Wiener solutions to finite input causality, integrates established insights from filter theory into this method. The experimental outcomes corroborate the effectiveness of our proposed approach, surpassing other baseline models in performance and generalization. The official code is available at https://github.com/ZhaoF-i/ASTWS-AEC
标题: 利用新型方法和MultiNAM数据集推进NAM到语音的转换
链接:https://arxiv.org/abs/2412.18839
备注:Accepted at IEEE ICASSP 2025
摘要:目前的非可听杂音(NAM)到语音技术依赖于语音克隆来模拟成对耳语的地面真实语音。然而,模拟的语音往往缺乏可懂度,并且不能很好地概括不同的说话者。为了解决这个问题,我们专注于从成对的耳语和文本中学习音素级对齐,并采用文本到语音(TTS)系统来模拟地面实况。为了减少对耳语的依赖,我们直接从NAM学习音素对齐,尽管质量受到可用训练数据的限制。为了进一步减轻对NAM/耳语数据的依赖,我们建议结合嘴唇模态来推断语音,并引入一种新的基于扩散的方法,该方法利用了嘴唇到语音技术的最新进展。此外,我们发布了MultiNAM数据集,其中包含来自两个扬声器的配对NAM,耳语,视频和文本数据,并在此数据集上对所有方法进行基准测试。语音样本和数据集可在\url{https://diff-nam.github.io/DiffNAM/}
摘要:Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at \url{https://diff-nam.github.io/DiffNAM/}
标题: MRI 2 Speech:实时MRI记录的关节运动的语音合成
链接:https://arxiv.org/abs/2412.18836
备注:Accepted at IEEE ICASSP 2025
摘要:以前的实时MRI(rtMRI)为基础的语音合成模型严重依赖于嘈杂的地面实况语音。直接在地面真实梅尔频谱图上应用损失会使语音内容与MRI噪声纠缠在一起,导致可懂度差。我们引入了一种新的方法,该方法采用多模态自监督AV-HuBERT模型进行rtMRI文本预测,并采用了一种新的基于流的持续时间预测器进行特定于说话人的对齐。然后,语音解码器使用预测的文本和持续时间来合成任何新语音中的对齐语音。我们在两个数据集上进行了彻底的实验,并证明了我们的方法对看不见的说话者的泛化能力。我们通过掩蔽rtMRI视频的一部分来评估我们框架的性能,以评估不同发音器对文本预测的影响。我们的方法在USC-TIMIT MRI语料库上实现了15.18\%$的字错误率(WER),这标志着对当前最先进技术的巨大改进。语音样本可在\url{https://mri2speech.github.io/MRI2Speech/}获得
摘要:Previous real-time MRI (rtMRI)-based speech synthesis models depend heavily on noisy ground-truth speech. Applying loss directly over ground truth mel-spectrograms entangles speech content with MRI noise, resulting in poor intelligibility. We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI and incorporates a new flow-based duration predictor for speaker-specific alignment. The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice. We conduct thorough experiments on two datasets and demonstrate our method's generalization ability to unseen speakers. We assess our framework's performance by masking parts of the rtMRI video to evaluate the impact of different articulators on text prediction. Our method achieves a $15.18\%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art. Speech samples are available at \url{https://mri2speech.github.io/MRI2Speech/}
标题: 通过多尺度多模式上下文交互实现具有表达力的视频配音
链接:https://arxiv.org/abs/2412.18748
备注:Accepted by ICSSP 2025
摘要:自动视频配音(AVD)从脚本中生成与嘴唇运动和面部情绪一致的语音。近年来的研究主要集中在对多模态语境进行建模以增强韵律表达能力,但忽略了两个关键问题:1)语境中的多尺度韵律表达属性对当前句子的韵律产生影响。2)上下文中的韵律线索与当前句子相互作用,影响最终的韵律表达力。为了应对这些挑战,我们提出了M2 CI-Dubber,一个多尺度多模态上下文交互方案AVD。该方案包括两个共享的M2 CI编码器,用于对多尺度多模态上下文进行建模,并促进其与当前句子的深度交互。该方法通过提取上下文中每种模态的全局和局部特征,利用基于注意力的聚合和交互机制,采用基于交互的图注意力网络进行融合,增强了当前句子合成语音的韵律表达能力。在Chem数据集上的实验表明,我们的模型在配音表现力方面优于基线。代码和演示可在\textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}获得。
摘要:Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.
标题: 用于对话语音合成的模式内和模式间上下文交互建模
链接:https://arxiv.org/abs/2412.18733
备注:Accepted by ICASSP 2025
摘要:会话语音合成(CSS)旨在有效地利用多模态对话历史(MDH)生成具有适当会话韵律的语音。CSS的关键挑战是对MDH和目标话语之间的交互进行建模。MDH中的语篇模态和言语模态各有其独特的影响,它们相辅相成,对目标语产生综合影响。以前的作品没有明确的模型,这样的模态内和模态间的相互作用。为了解决这个问题,我们提出了一个新的模态内和模态间的上下文交互模式为基础的CSS系统,称为III-CSS。具体而言,在训练阶段,我们将MDH与目标话语中的文本和语音模态相结合,以获得四种模态组合,包括历史文本-下一文本,历史语音-下一语音,历史文本-下一语音和历史语音-下一文本。然后,我们设计了两个基于对比学习的模态内和两个模态间交互模块,以深入学习模态内和模态间的上下文交互。在推理阶段,我们采取MDH和采用训练的交互模块,充分推断目标话语的文本内容的语音韵律。在DailyTalk数据集上进行的主观和客观实验表明,III-CSS在韵律表现力方面优于高级基线。代码和语音示例可在https://github.com/AI-S2-Lab/I3CSS上获得。
摘要:Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.
标题: Simi-SFX:一种基于相似度的可控音效合成方法
链接:https://arxiv.org/abs/2412.18710
摘要:生成具有可控变化的声音效果是一项具有挑战性的任务,传统上使用需要深入了解信号处理参数和算法的复杂物理模型来解决。在生成和大型语言模型的时代,文本已经成为控制声音合成的常见的、人类可解释的界面。然而,语言标记的离散和定性性质使得难以捕捉不同声音之间的细微音色变化。在这项研究中,我们提出了一种新型的基于相似性的声音合成调节方法,利用可微分数字信号处理(DDSP)。该方法结合使用潜在空间学习和控制音频音色与直观的指导向量,在范围[0,1]内归一化,以编码分类声学信息。通过利用预先训练的音频表示模型,我们的方法实现了表达和细粒度的音色控制。为了对我们的方法进行基准测试,我们引入了两个声音效果数据集-足迹集和影响集-旨在评估可控性和声音质量。回归分析表明,建议的相似性得分有效地控制音色的变化,并使创造性的应用,如离散类之间的音色插值。我们的工作为音效合成提供了一个强大而通用的框架,弥合了传统信号处理和现代机器学习技术之间的差距。
摘要:Generating sound effects with controllable variations is a challenging task, traditionally addressed using sophisticated physical models that require in-depth knowledge of signal processing parameters and algorithms. In the era of generative and large language models, text has emerged as a common, human-interpretable interface for controlling sound synthesis. However, the discrete and qualitative nature of language tokens makes it difficult to capture subtle timbral variations across different sounds. In this research, we propose a novel similarity-based conditioning method for sound synthesis, leveraging differentiable digital signal processing (DDSP). This approach combines the use of latent space for learning and controlling audio timbre with an intuitive guiding vector, normalized within the range [0,1], to encode categorical acoustic information. By utilizing pre-trained audio representation models, our method achieves expressive and fine-grained timbre control. To benchmark our approach, we introduce two sound effect datasets--Footstep-set and Impact-set--designed to evaluate both controllability and sound quality. Regression analysis demonstrates that the proposed similarity score effectively controls timbre variations and enables creative applications such as timbre interpolation between discrete classes. Our work provides a robust and versatile framework for sound effect synthesis, bridging the gap between traditional signal processing and modern machine learning techniques.
标题: 基于元学习的无延迟子带自适应过滤器利用复自注意力进行主动噪音控制
链接:https://arxiv.org/abs/2412.19471
备注:31 pages, 8 figures
摘要:有源噪声控制通常采用自适应滤波来产生次级噪声,其中最小均方算法是最广泛使用的。然而,传统的更新规则是线性的,在解决非线性环境和非平稳噪声表现出有限的有效性。为了应对这一挑战,我们将有源噪声控制问题重新表述为元学习问题,并提出了一种基于元学习的无延迟子带自适应滤波器和深度神经网络。其核心思想是利用神经网络作为自适应算法,可以适应不同的环境和噪声类型。神经网络将在噪声观测下进行训练,这意味着它可以在没有真实标签的情况下识别优化的更新规则。设计了一种单头注意力递归神经网络,该网络具有可学习的特征嵌入,可以有效地更新自适应滤波器的权重,从而能够精确计算次级源以衰减不需要的初级噪声。为了放松对更新自适应滤波器权重的时间约束,采用无延迟子带架构,这将允许系统随着下采样因子的增加而不太频繁地更新。此外,无延迟子带架构不会在有源噪声控制系统中引入额外的时间延迟。引入跳过更新策略,进一步降低更新频率,使资源有限的机器有更多的可能性登上我们的基于元学习的模型。广泛的多条件训练确保了对各种类型的噪声和环境的泛化和鲁棒性。仿真结果表明,与传统方法相比,我们的基于元学习的模型具有更好的降噪性能。
摘要:Active noise control typically employs adaptive filtering to generate secondary noise, where the least mean square algorithm is the most widely used. However, traditional updating rules are linear and exhibit limited effectiveness in addressing nonlinear environments and nonstationary noise. To tackle this challenge, we reformulate the active noise control problem as a meta-learning problem and propose a meta-learning-based delayless subband adaptive filter with deep neural networks. The core idea is to utilize a neural network as an adaptive algorithm that can adapt to different environments and types of noise. The neural network will train under noisy observations, implying that it recognizes the optimized updating rule without true labels. A single-headed attention recurrent neural network is devised with learnable feature embedding to update the adaptive filter weight efficiently, enabling accurate computation of the secondary source to attenuate the unwanted primary noise. In order to relax the time constraint on updating the adaptive filter weights, the delayless subband architecture is employed, which will allow the system to be updated less frequently as the downsampling factor increases. In addition, the delayless subband architecture does not introduce additional time delays in active noise control systems. A skip updating strategy is introduced to decrease the updating frequency further so that machines with limited resources have more possibility to board our meta-learning-based model. Extensive multi-condition training ensures generalization and robustness against various types of noise and environments. Simulation results demonstrate that our meta-learning-based model achieves superior noise reduction performance compared to traditional methods.
标题: SecureDiT:用于环境感知语音合成的双条件扩散Transformer
链接:https://arxiv.org/abs/2412.19259
备注:Accepted to ICASSP 2025
摘要:我们提出了VoiceDiT,一个多模态生成模型,用于从文本和视觉提示中产生环境感知的语音和音频。虽然将语音与文本对齐对于可理解的语音至关重要,但在嘈杂条件下实现这种对齐仍然是该领域的一个重大挑战。为了解决这个问题,我们提出了一种新的音频生成流水线命名为VoiceDiT。该管道包括三个关键组成部分:(1)创建用于预训练的大规模合成语音数据集和用于微调的精炼的真实世界语音数据集,(2)Dual-DiT,一种被设计为有效地保留对齐的语音信息同时准确地反映环境条件的模型,以及(3)基于扩散的图像到音频转换器,其允许模型桥接音频与图像之间的间隙,促进与多模态提示对准的环境声音的产生。大量的实验结果表明,VoiceDiT在现实世界的数据集上优于以前的模型,在音频质量和模态集成方面都有显着的改进。
摘要:We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.
标题: 基于量化自我监督学习特征的预测语义的因果语音增强
链接:https://arxiv.org/abs/2412.19248
备注:Accepted for ICASSP 2025, 5 pages
摘要:实时语音增强是在线语音通信的关键。因果SE模型只使用以前的上下文,而预测未来的信息,如音素延续,可能有助于执行因果SE。语音信息通常通过量化自监督学习(SSL)模型的潜在特征来表示。这项工作是第一个将SSL功能与因果关系纳入SE模型。因果SSL特征被编码并使用特征线性调制与频谱图特征组合以估计用于增强噪声输入语音的掩模。同时,我们使用矢量量化来表示语音特征作为语义令牌的因果SSL功能。该模型不仅对SSL特征进行编码,而且可以预测多任务学习(MTL)中的未来语义标记。使用VoiceBank + DEMAND数据集的实验结果表明,我们提出的方法在PESQ中达到2.88,特别是语义预测MTL,其中我们证实了语义预测在因果SE中发挥了重要作用。
摘要:Real-time speech enhancement (SE) is essential to online speech communication. Causal SE models use only the previous context while predicting future information, such as phoneme continuation, may help performing causal SE. The phonetic information is often represented by quantizing latent features of self-supervised learning (SSL) models. This work is the first to incorporate SSL features with causality into an SE model. The causal SSL features are encoded and combined with spectrogram features using feature-wise linear modulation to estimate a mask for enhancing the noisy input speech. Simultaneously, we quantize the causal SSL features using vector quantization to represent phonetic characteristics as semantic tokens. The model not only encodes SSL features but also predicts the future semantic tokens in multi-task learning (MTL). The experimental results using VoiceBank + DEMAND dataset show that our proposed method achieves 2.88 in PESQ, especially with semantic prediction MTL, in which we confirm that the semantic prediction played an important role in causal SE.
标题: 攻击具有增强特征和说话者身份差异的语音合成系统
链接:https://arxiv.org/abs/2412.19068
备注:2 pages, submitted to ICASSP 2025 GC-7: The First VoicePrivacy Attacker Challenge (by invitation)
摘要:本研究的重点是ICASSP 2025信号处理大挑战赛中的第一个语音隐私攻击者挑战赛,该挑战赛旨在开发能够确定两个匿名语音信号是否来自同一说话人的说话人验证系统。然而,原始和匿名语音的特征分布之间的差异使这项任务复杂化。为了解决这一挑战,我们提出了一个攻击系统,结合数据增强增强的特征表示和扬声器身份差异增强分类器,以提高验证性能,称为DA-SID。具体而言,数据增强策略(即,数据融合和SpecAugment)来减小特征分布差距,而概率线性判别分析(PLDA)用于进一步增强说话人身份差异。我们的系统显著优于基线,在各种语音匿名系统中表现出卓越的有效性和鲁棒性,最终确保了挑战的前5名。
摘要:This study focuses on the First VoicePrivacy Attacker Challenge within the ICASSP 2025 Signal Processing Grand Challenge, which aims to develop speaker verification systems capable of determining whether two anonymized speech signals are from the same speaker. However, differences between feature distributions of original and anonymized speech complicate this task. To address this challenge, we propose an attacker system that combines Data Augmentation enhanced feature representation and Speaker Identity Difference enhanced classifier to improve verification performance, termed DA-SID. Specifically, data augmentation strategies (i.e., data fusion and SpecAugment) are utilized to mitigate feature distribution gaps, while probabilistic linear discriminant analysis (PLDA) is employed to further enhance speaker identity difference. Our system significantly outperforms the baseline, demonstrating exceptional effectiveness and robustness against various voice anonymization systems, ultimately securing a top-5 ranking in the challenge.
标题: 用于发音障碍和老年人语音识别的基础模型的结构化说话人缺陷适应
链接:https://arxiv.org/abs/2412.18832
摘要:语音基础模型(SFM)的数据密集型微调,以稀缺和多样的构音障碍和老年人的讲话导致数据偏见和穷人的泛化看不见的扬声器。本文提出了新的结构化的说话人不足的适应方法,SSL预训练的SFM等数据。在监督自适应微调阶段构建了说话人和语音缺陷不变的SFM,以减少对训练数据说话人的不适当偏见,并作为测试时间无监督自适应的更中立和鲁棒的起点。语音变异归因于扬声器的身份和语音障碍的严重程度,或老化引起的神经认知下降,建模使用单独的适配器,可以结合在一起,以模拟任何可见或不可见的扬声器。在UASpeech构音障碍和DementiaBank Pitt老年人语音语料库上的实验表明,HuBERT和Wav 2 vec 2-conformer模型的结构化说话人缺陷适应始终优于使用以下任一种的基线SFM:a)没有适配器; b)所有说话人共享全局适配器;或c)单属性适配器单独建模扬声器或缺陷标签,统计上显著的WER减少高达3.01%和1.50%绝对值(10.86%和6.94%)。最低公布的WER为19.45%(49.34%的非常低的可懂度,33.17%的看不见的话)是在UASpeech测试集的16个构音障碍的扬声器。
摘要:Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.
标题: 埃塞俄比亚东正教Tewahedo教堂圣歌中Yaredawi YeZema淤泥的计算分析
链接:https://arxiv.org/abs/2412.18788
备注:6 pages
摘要:尽管其音乐学,文化和宗教意义,埃塞俄比亚东正教Tewahedo教会(EOTC)圣歌在音乐研究中的代表性相对不足。历史记录,包括手稿、研究论文和口头传统,证实了圣雅列在6世纪建立了三种典型的EOTC念诵模式。本文试图利用音乐信息检索技术对EOTC圣歌进行研究。在有关EOTC圣歌分析和理解的研究问题中,Yaredawi YeZema Silt,即遵循圣亚里德标准的圣歌模式,是至关重要的。因此,我们认为在EOTC圣歌Yaredawi YeZema淤泥分类的任务,通过引入一个新的数据集,并展示了一系列的分类实验,这项任务。结果表明,使用稳定的音高轮廓的分布作为一个简单的基于神经网络的分类器的特征表示成为一个有效的解决方案。本研究将借由与先前关于EOTC圣歌的民族音乐学文献的比较研究,进一步探讨这些结果的音乐学意涵与洞见。通过公开这一数据集,我们的目标是促进EOTC圣歌的未来探索和分析,并突出进一步研究的潜在方向,从而促进对这一独特精神和文化遗产的更深入理解和保护。
摘要:Despite its musicological, cultural, and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chant is relatively underrepresented in music research. Historical records, including manuscripts, research papers, and oral traditions, confirm Saint Yared's establishment of three canonical EOTC chanting modes during the 6th century. This paper attempts to investigate the EOTC chants using music information retrieval (MIR) techniques. Among the research questions regarding the analysis and understanding of EOTC chants, Yaredawi YeZema Silt, namely the mode of chanting adhering to Saint Yared's standards, is of primary importance. Therefore, we consider the task of Yaredawi YeZema Silt classification in EOTC chants by introducing a new dataset and showcasing a series of classification experiments for this task. Results show that using the distribution of stabilized pitch contours as the feature representation on a simple neural network-based classifier becomes an effective solution. The musicological implications and insights of such results are further discussed through a comparative study with the previous ethnomusicology literature on EOTC chants. By making this dataset publicly accessible, we aim to promote future exploration and analysis of EOTC chants and highlight potential directions for further research, thereby fostering a deeper understanding and preservation of this unique spiritual and cultural heritage.
标题: 基于元学习的无延迟子带自适应过滤器利用复自注意力进行主动噪音控制
链接:https://arxiv.org/abs/2412.19471
备注:31 pages, 8 figures
摘要:有源噪声控制通常采用自适应滤波来产生次级噪声,其中最小均方算法是最广泛使用的。然而,传统的更新规则是线性的,在解决非线性环境和非平稳噪声表现出有限的有效性。为了应对这一挑战,我们将有源噪声控制问题重新表述为元学习问题,并提出了一种基于元学习的无延迟子带自适应滤波器和深度神经网络。其核心思想是利用神经网络作为自适应算法,可以适应不同的环境和噪声类型。神经网络将在噪声观测下进行训练,这意味着它可以在没有真实标签的情况下识别优化的更新规则。设计了一种单头注意力递归神经网络,该网络具有可学习的特征嵌入,可以有效地更新自适应滤波器的权重,从而能够精确计算次级源以衰减不需要的初级噪声。为了放松对更新自适应滤波器权重的时间约束,采用无延迟子带架构,这将允许系统随着下采样因子的增加而不太频繁地更新。此外,无延迟子带架构不会在有源噪声控制系统中引入额外的时间延迟。引入跳过更新策略,进一步降低更新频率,使资源有限的机器有更多的可能性登上我们的基于元学习的模型。广泛的多条件训练确保了对各种类型的噪声和环境的泛化和鲁棒性。仿真结果表明,与传统方法相比,我们的基于元学习的模型具有更好的降噪性能。
摘要:Active noise control typically employs adaptive filtering to generate secondary noise, where the least mean square algorithm is the most widely used. However, traditional updating rules are linear and exhibit limited effectiveness in addressing nonlinear environments and nonstationary noise. To tackle this challenge, we reformulate the active noise control problem as a meta-learning problem and propose a meta-learning-based delayless subband adaptive filter with deep neural networks. The core idea is to utilize a neural network as an adaptive algorithm that can adapt to different environments and types of noise. The neural network will train under noisy observations, implying that it recognizes the optimized updating rule without true labels. A single-headed attention recurrent neural network is devised with learnable feature embedding to update the adaptive filter weight efficiently, enabling accurate computation of the secondary source to attenuate the unwanted primary noise. In order to relax the time constraint on updating the adaptive filter weights, the delayless subband architecture is employed, which will allow the system to be updated less frequently as the downsampling factor increases. In addition, the delayless subband architecture does not introduce additional time delays in active noise control systems. A skip updating strategy is introduced to decrease the updating frequency further so that machines with limited resources have more possibility to board our meta-learning-based model. Extensive multi-condition training ensures generalization and robustness against various types of noise and environments. Simulation results demonstrate that our meta-learning-based model achieves superior noise reduction performance compared to traditional methods.
标题: 建立一个推广到无序语音的单一ASB模型
链接:https://arxiv.org/abs/2412.19315
备注:Accepted at ICASSP 2025
摘要:本研究调查了将无序语音记录(1,000小时)的数据集整合到接近最先进的ASR基线系统的微调中的影响。与人们所期望的相反,尽管数据不到ASR系统训练数据的1%,但我们发现无序语音识别准确率有了相当大的提高。具体来说,我们观察到提示语音有33%的改善,而新收集的自发的,无序语音的会话数据集有26%的改善。重要的是,在标准语音识别基准测试中没有显着的性能下降。此外,我们观察到,拟议的调优策略有助于缩小基线系统和个性化模型之间的差距64%,突出了重大进展以及改进的空间。考虑到我们的研究结果的巨大好处,这个实验表明,从公平的角度来看,将一小部分高质量的无序语音数据纳入训练配方是一个简单的步骤,可以使语音技术更容易为语言障碍用户所用。
摘要:This study investigates the impact of integrating a dataset of disordered speech recordings ($\sim$1,000 hours) into the fine-tuning of a near state-of-the-art ASR baseline system. Contrary to what one might expect, despite the data being less than 1% of the training data of the ASR system, we find a considerable improvement in disordered speech recognition accuracy. Specifically, we observe a 33% improvement on prompted speech, and a 26% improvement on a newly gathered spontaneous, conversational dataset of disordered speech. Importantly, there is no significant performance decline on standard speech recognition benchmarks. Further, we observe that the proposed tuning strategy helps close the gap between the baseline system and personalized models by 64% highlighting the significant progress as well as the room for improvement. Given the substantial benefits of our findings, this experiment suggests that from a fairness perspective, incorporating a small fraction of high quality disordered speech data in a training recipe is an easy step that could be done to make speech technology more accessible for users with speech disabilities.
标题: SecureDiT:用于环境感知语音合成的双条件扩散Transformer
链接:https://arxiv.org/abs/2412.19259
备注:Accepted to ICASSP 2025
摘要:我们提出了VoiceDiT,一个多模态生成模型,用于从文本和视觉提示中产生环境感知的语音和音频。虽然将语音与文本对齐对于可理解的语音至关重要,但在嘈杂条件下实现这种对齐仍然是该领域的一个重大挑战。为了解决这个问题,我们提出了一种新的音频生成流水线命名为VoiceDiT。该管道包括三个关键组成部分:(1)创建用于预训练的大规模合成语音数据集和用于微调的精炼的真实世界语音数据集,(2)Dual-DiT,一种被设计为有效地保留对齐的语音信息同时准确地反映环境条件的模型,以及(3)基于扩散的图像到音频转换器,其允许模型桥接音频与图像之间的间隙,促进与多模态提示对准的环境声音的产生。大量的实验结果表明,VoiceDiT在现实世界的数据集上优于以前的模型,在音频质量和模态集成方面都有显着的改进。
摘要:We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.
标题: 基于量化自我监督学习特征的预测语义的因果语音增强
链接:https://arxiv.org/abs/2412.19248
备注:Accepted for ICASSP 2025, 5 pages
摘要:实时语音增强是在线语音通信的关键。因果SE模型只使用以前的上下文,而预测未来的信息,如音素延续,可能有助于执行因果SE。语音信息通常通过量化自监督学习(SSL)模型的潜在特征来表示。这项工作是第一个将SSL功能与因果关系纳入SE模型。因果SSL特征被编码并使用特征线性调制与频谱图特征组合以估计用于增强噪声输入语音的掩模。同时,我们使用矢量量化来表示语音特征作为语义令牌的因果SSL功能。该模型不仅对SSL特征进行编码,而且可以预测多任务学习(MTL)中的未来语义标记。使用VoiceBank + DEMAND数据集的实验结果表明,我们提出的方法在PESQ中达到2.88,特别是语义预测MTL,其中我们证实了语义预测在因果SE中发挥了重要作用。
摘要:Real-time speech enhancement (SE) is essential to online speech communication. Causal SE models use only the previous context while predicting future information, such as phoneme continuation, may help performing causal SE. The phonetic information is often represented by quantizing latent features of self-supervised learning (SSL) models. This work is the first to incorporate SSL features with causality into an SE model. The causal SSL features are encoded and combined with spectrogram features using feature-wise linear modulation to estimate a mask for enhancing the noisy input speech. Simultaneously, we quantize the causal SSL features using vector quantization to represent phonetic characteristics as semantic tokens. The model not only encodes SSL features but also predicts the future semantic tokens in multi-task learning (MTL). The experimental results using VoiceBank + DEMAND dataset show that our proposed method achieves 2.88 in PESQ, especially with semantic prediction MTL, in which we confirm that the semantic prediction played an important role in causal SE.
标题: 利用预训练模型进行图形增强的双流特征融合用于声学交通监测
链接:https://arxiv.org/abs/2412.19078
备注:Shitong Fan and Feiyang Xiao contributed equally. Accepted by the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP)2025
摘要:麦克风阵列技术广泛应用于声源定位和基于声学的智慧城市交通监控中,但由于标记的真实世界交通音频数据的稀缺性以及应用场景的复杂性和多样性,这些应用面临着重大挑战。DCASE挑战赛的任务10侧重于使用多声道音频信号来计算车辆(汽车或商用车)并识别它们的方向(从左到右或反之亦然)。在本文中,我们提出了一个图形增强的双流特征融合网络(GEDF-Net)的声学交通监测,同时考虑车辆类型和方向,以提高检测。我们提出了一种图增强的双流特征融合策略,该策略由车辆类型特征提取(VTFE)分支,车辆方向特征提取(VDFE)分支,和一个帧级特征融合模块相结合的类型和方向的功能,以提高性能。在VTFE分支中使用预训练模型(PANN)来缓解数据稀缺性并增强类型特征,然后使用图形注意力机制来利用时间关系并突出显示这些特征中的重要音频事件。方向和类型特征的帧级融合实现了细粒度的特征表示,从而获得更好的检测性能。实验证明了该方法的有效性。GEDF-Net是我们在DCASE 2024挑战任务10中获得第一名的作品。
摘要:Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or commercial vehicles) and identify their directions (left-to-right or vice versa). In this paper, we propose a graph-enhanced dual-stream feature fusion network (GEDF-Net) for acoustic traffic monitoring, which simultaneously considers vehicle type and direction to improve detection. We propose a graph-enhanced dual-stream feature fusion strategy which consists of a vehicle type feature extraction (VTFE) branch, a vehicle direction feature extraction (VDFE) branch, and a frame-level feature fusion module to combine the type and direction feature for enhanced performance. A pre-trained model (PANNs) is used in the VTFE branch to mitigate data scarcity and enhance the type features, followed by a graph attention mechanism to exploit temporal relationships and highlight important audio events within these features. The frame-level fusion of direction and type features enables fine-grained feature representation, resulting in better detection performance. Experiments demonstrate the effectiveness of our proposed method. GEDF-Net is our submission that achieved 1st place in the DCASE 2024 Challenge Task 10.
标题: 用于抑郁症筛查的稳健语音和自然语言处理模型
链接:https://arxiv.org/abs/2412.19072
备注:None
摘要:抑郁症是一个全球性的健康问题,迫切需要增加患者筛查。语音技术为远程筛查提供了优势,但必须在患者中稳健地执行。我们已经描述了为此目的开发的两个深度学习模型。一个模型基于声学;另一个基于自然语言处理。这两种模型都采用了迁移学习。使用来自抑郁标记语料库的数据,其中11,000个独特的用户使用会话语音与人机应用程序进行交互。二元抑郁症分类的结果表明,这两种模型在没有说话者重叠的情况下,对看不见的数据的AUC=0.80或以上。性能作为测试子集特征的函数进行了进一步分析,发现这些模型通常对扬声器和会话变量具有鲁棒性。我们的结论是,基于这些方法的模型提供了广义的自动抑郁症筛查的承诺。
摘要:Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.
标题: 攻击具有增强特征和说话者身份差异的语音合成系统
链接:https://arxiv.org/abs/2412.19068
备注:2 pages, submitted to ICASSP 2025 GC-7: The First VoicePrivacy Attacker Challenge (by invitation)
摘要:本研究的重点是ICASSP 2025信号处理大挑战赛中的第一个语音隐私攻击者挑战赛,该挑战赛旨在开发能够确定两个匿名语音信号是否来自同一说话人的说话人验证系统。然而,原始和匿名语音的特征分布之间的差异使这项任务复杂化。为了解决这一挑战,我们提出了一个攻击系统,结合数据增强增强的特征表示和扬声器身份差异增强分类器,以提高验证性能,称为DA-SID。具体而言,数据增强策略(即,数据融合和SpecAugment)来减小特征分布差距,而概率线性判别分析(PLDA)用于进一步增强说话人身份差异。我们的系统显著优于基线,在各种语音匿名系统中表现出卓越的有效性和鲁棒性,最终确保了挑战的前5名。
摘要:This study focuses on the First VoicePrivacy Attacker Challenge within the ICASSP 2025 Signal Processing Grand Challenge, which aims to develop speaker verification systems capable of determining whether two anonymized speech signals are from the same speaker. However, differences between feature distributions of original and anonymized speech complicate this task. To address this challenge, we propose an attacker system that combines Data Augmentation enhanced feature representation and Speaker Identity Difference enhanced classifier to improve verification performance, termed DA-SID. Specifically, data augmentation strategies (i.e., data fusion and SpecAugment) are utilized to mitigate feature distribution gaps, while probabilistic linear discriminant analysis (PLDA) is employed to further enhance speaker identity difference. Our system significantly outperforms the baseline, demonstrating exceptional effectiveness and robustness against various voice anonymization systems, ultimately securing a top-5 ranking in the challenge.
标题: 通过双焦点偏好优化增强视听语音识别
链接:https://arxiv.org/abs/2412.19005
备注:Accepted by AAAI 2025
摘要:视听自动语音识别(AV-ASR)旨在通过利用视觉信号来提高语音识别的准确性。由于噪声声学环境、自发语音和视觉信息的不确定使用,在各个领域的不受约束的真实世界场景中,这是特别具有挑战性的。大多数以前的作品都在视听数据集上微调纯音频ASR模型,针对传统的ASR目标进行优化。然而,他们往往忽略了视觉功能和常见的错误,在无约束的视频场景。在本文中,我们提出使用偏好优化策略来提高真实世界视频的语音识别准确率。首先,我们通过从两个焦点模拟AV-ASR中发生的常见错误来创建偏好数据:操纵音频或视觉输入并重写输出转录。其次,我们提出了BPO-AVASR,一种双焦点偏好优化方法,通过利用输入侧和输出侧偏好来改进AV-ASR模型。大量的实验表明,我们的方法显着提高了语音识别的准确性在各个领域,优于以前的国家的最先进的模型对现实世界的视频语音识别。
摘要:Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.
标题: 防止主动噪音控制中的输出饱和:一种输出约束的卡尔曼过滤器方法
链接:https://arxiv.org/abs/2412.18887
摘要:与最小均方(LMS)方法相比,基于卡尔曼滤波器(KF)的有源噪声控制(ANC)系统具有更好的跟踪能力和更快的收敛速度,尤其是在动态噪声消除场景中。然而,在噪声水平极高的环境中,由于硬件限制,控制信号的功率可能超过系统的额定输出功率,导致输出饱和和随后的非线性。为了缓解这个问题,提出了一种具有输出约束的修正KF。在这种方法中,被视为测量的干扰被重新缩放的约束因子,这是由系统的额定功率,次级路径增益,和干扰功率。结果,系统的输出功率,即控制信号,被间接地约束在系统的最大输出内,从而确保稳定性。仿真结果表明,该算法不仅实现了动态噪声的快速抑制,而且有效地防止了输出饱和引起的非线性,具有一定的实用意义。
摘要:The Kalman filter (KF)-based active noise control (ANC) system demonstrates superior tracking and faster convergence compared to the least mean square (LMS) method, particularly in dynamic noise cancellation scenarios. However, in environments with extremely high noise levels, the power of the control signal can exceed the system's rated output power due to hardware limitations, leading to output saturation and subsequent non-linearity. To mitigate this issue, a modified KF with an output constraint is proposed. In this approach, the disturbance treated as an measurement is re-scaled by a constraint factor, which is determined by the system's rated power, the secondary path gain, and the disturbance power. As a result, the output power of the system, i.e. the control signal, is indirectly constrained within the maximum output of the system, ensuring stability. Simulation results indicate that the proposed algorithm not only achieves rapid suppression of dynamic noise but also effectively prevents non-linearity due to output saturation, highlighting its practical significance.
标题: 用于发音障碍和老年人语音识别的基础模型的结构化说话人缺陷适应
链接:https://arxiv.org/abs/2412.18832
摘要:语音基础模型(SFM)的数据密集型微调,以稀缺和多样的构音障碍和老年人的讲话导致数据偏见和穷人的泛化看不见的扬声器。本文提出了新的结构化的说话人不足的适应方法,SSL预训练的SFM等数据。在监督自适应微调阶段构建了说话人和语音缺陷不变的SFM,以减少对训练数据说话人的不适当偏见,并作为测试时间无监督自适应的更中立和鲁棒的起点。语音变异归因于扬声器的身份和语音障碍的严重程度,或老化引起的神经认知下降,建模使用单独的适配器,可以结合在一起,以模拟任何可见或不可见的扬声器。在UASpeech构音障碍和DementiaBank Pitt老年人语音语料库上的实验表明,HuBERT和Wav 2 vec 2-conformer模型的结构化说话人缺陷适应始终优于使用以下任一种的基线SFM:a)没有适配器; b)所有说话人共享全局适配器;或c)单属性适配器单独建模扬声器或缺陷标签,统计上显著的WER减少高达3.01%和1.50%绝对值(10.86%和6.94%)。最低公布的WER为19.45%(49.34%的非常低的可懂度,33.17%的看不见的话)是在UASpeech测试集的16个构音障碍的扬声器。
摘要:Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.
标题: 埃塞俄比亚东正教Tewahedo教堂圣歌中Yaredawi YeZema淤泥的计算分析
链接:https://arxiv.org/abs/2412.18788
备注:6 pages
摘要:尽管其音乐学,文化和宗教意义,埃塞俄比亚东正教Tewahedo教会(EOTC)圣歌在音乐研究中的代表性相对不足。历史记录,包括手稿、研究论文和口头传统,证实了圣雅列在6世纪建立了三种典型的EOTC念诵模式。本文试图利用音乐信息检索技术对EOTC圣歌进行研究。在关于EOTC圣歌的分析和理解的研究问题中,Yaredawi YeZema Silt,即遵循圣雅列标准的圣歌模式,是首要的。因此,我们认为在EOTC圣歌Yaredawi YeZema淤泥分类的任务,通过引入一个新的数据集,并展示了一系列的分类实验,这项任务。结果表明,使用稳定的音高轮廓的分布作为一个简单的基于神经网络的分类器的特征表示成为一个有效的解决方案。通过与之前关于EOTC圣歌的民族音乐学文献的比较研究,进一步讨论了这些结果的音乐学含义和见解。通过公开这一数据集,我们的目标是促进EOTC圣歌的未来探索和分析,并突出进一步研究的潜在方向,从而促进对这一独特精神和文化遗产的更深入理解和保护。
摘要:Despite its musicological, cultural, and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chant is relatively underrepresented in music research. Historical records, including manuscripts, research papers, and oral traditions, confirm Saint Yared's establishment of three canonical EOTC chanting modes during the 6th century. This paper attempts to investigate the EOTC chants using music information retrieval (MIR) techniques. Among the research questions regarding the analysis and understanding of EOTC chants, Yaredawi YeZema Silt, namely the mode of chanting adhering to Saint Yared's standards, is of primary importance. Therefore, we consider the task of Yaredawi YeZema Silt classification in EOTC chants by introducing a new dataset and showcasing a series of classification experiments for this task. Results show that using the distribution of stabilized pitch contours as the feature representation on a simple neural network-based classifier becomes an effective solution. The musicological implications and insights of such results are further discussed through a comparative study with the previous ethnomusicology literature on EOTC chants. By making this dataset publicly accessible, we aim to promote future exploration and analysis of EOTC chants and highlight potential directions for further research, thereby fostering a deeper understanding and preservation of this unique spiritual and cultural heritage.
标题: Zema数据集:Yaredawi Zema的全面研究,重点关注钟表圣歌
链接:https://arxiv.org/abs/2412.18784
备注:6 pages
摘要:计算音乐研究在推动音乐制作、发行和理解全球各种音乐风格方面发挥着至关重要的作用。尽管具有巨大的文化和宗教意义,但埃塞俄比亚东正教Tewahedo教会(EOTC)的圣歌在计算音乐研究中的代表性相对不足。本文通过介绍一个专门用于分析EOTC圣歌的新数据集(也称为Yaredawi Zema),为这一领域做出了贡献。这项工作提供了一个10小时的数据集,369个实例,创建和策展过程,包括严格的质量保证措施的全面概述。我们的数据集有一个详细的词级时间边界和阅读音调注释以及相应的诵经模式标签的音频。此外,我们还通过相应的注释,确定了手稿中与多个念诵符号相关的念诵选项。我们向公众提供此数据集的目的是鼓励更多对EOTC圣歌的研究和学习,包括歌词转录,歌词到音频对齐和音乐生成任务。这种研究工作将促进知识和努力,以保护这种独特的礼仪音乐,这是埃塞俄比亚人民的无价文化手工艺品。
摘要:Computational music research plays a critical role in advancing music production, distribution, and understanding across various musical styles worldwide. Despite the immense cultural and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chants are relatively underrepresented in computational music research. This paper contributes to this field by introducing a new dataset specifically tailored for analyzing EOTC chants, also known as Yaredawi Zema. This work provides a comprehensive overview of a 10-hour dataset, 369 instances, creation, and curation process, including rigorous quality assurance measures. Our dataset has a detailed word-level temporal boundary and reading tone annotation along with the corresponding chanting mode label of audios. Moreover, we have also identified the chanting options associated with multiple chanting notations in the manuscript by annotating them accordingly. Our goal in making this dataset available to the public 1 is to encourage more research and study of EOTC chants, including lyrics transcription, lyric-to-audio alignment, and music generation tasks. Such research work will advance knowledge and efforts to preserve this distinctive liturgical music, a priceless cultural artifact for the Ethiopian people.
标题: 调查声学-文本情感不一致信息以自动抑郁检测
链接:https://arxiv.org/abs/2412.18614
摘要:先前的研究已经证明,来自单个声学情感标签的情感特征可以提高抑郁症诊断的准确性。此外,根据情绪情境不敏感理论和我们的初步研究,抑郁症患者可能会以出乎意料的平静方式传达负面情绪内容,在自然对话中表现出高度的情绪表达不一致。到目前为止,很少有研究认识到并利用情绪表达不一致性进行抑郁检测。本文提出了一种多模态交叉注意方法来捕获声音-文本情感不一致(ATEI)信息。这是通过分析声学和文本域中情感表达的复杂的局部和长期依赖性,以及两个域中情感内容之间的不匹配来实现的。然后,提出了一种基于变换器的模型,将此ATEI信息与各种融合策略相结合,用于检测抑郁症。此外,采用缩放技术来调整ATEI特征度在融合过程中,从而提高模型的能力,以区分不同程度的抑郁症患者的严重程度。据我们所知,这项工作是第一次将情绪表达不一致的信息纳入抑郁症检测。在一个咨询会话数据集上的实验结果说明了该方法的有效性。
摘要:Previous studies have demonstrated that emotional features from a single acoustic sentiment label can enhance depression diagnosis accuracy. Additionally, according to the Emotion Context-Insensitivity theory and our pilot study, individuals with depression might convey negative emotional content in an unexpectedly calm manner, showing a high degree of inconsistency in emotional expressions during natural conversations. So far, few studies have recognized and leveraged the emotional expression inconsistency for depression detection. In this paper, a multimodal cross-attention method is presented to capture the Acoustic-Textual Emotional Inconsistency (ATEI) information. This is achieved by analyzing the intricate local and long-term dependencies of emotional expressions across acoustic and textual domains, as well as the mismatch between the emotional content within both domains. A Transformer-based model is then proposed to integrate this ATEI information with various fusion strategies for detecting depression. Furthermore, a scaling technique is employed to adjust the ATEI feature degree during the fusion process, thereby enhancing the model's ability to discern patients with depression across varying levels of severity. To best of our knowledge, this work is the first to incorporate emotional expression inconsistency information into depression detection. Experimental results on a counseling conversational dataset illustrate the effectiveness of our method.
标题: 通过预算调整和代币化提高印度语言Whisper的准确性和速度
链接:https://arxiv.org/abs/2412.19785
备注:Accepted at ICASSP 2025, 5 pages, 1 figures, 5 tables
摘要:自动语音识别最近在大型基础模型(如Whisper)方面取得了重大进展。然而,这些模型通常难以在低资源语言(如印度语言)中表现良好。本文探讨了两种新的方法,以提高耳语的多语种语音识别性能在印度语言。首先,我们提出了与语言家族信息,这提高了耳语的准确性,在语言相似的语言调整。其次,我们引入了一种新的标记器,它减少了生成的标记的数量,从而加快了Whisper的推理速度。我们广泛的实验表明,标记器显着减少了推理时间,而快速调整提高了各种Whisper模型大小的准确性,包括小型,中型和大型。总之,这些技术实现了最佳WER和推理速度之间的平衡。
摘要:Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.
标题: ETTA:阐明文本到音频模型的设计空间
链接:https://arxiv.org/abs/2412.19351
摘要:近年来,文本到音频(TTA)合成取得了重大进展,使用户能够通过从自然语言提示生成的合成音频来丰富他们的创意工作流程。尽管取得了这一进展,但数据、模型架构、训练目标函数和采样策略对目标基准的影响还没有得到很好的理解。为了全面了解TTA模型的设计空间,我们建立了一个大规模的经验实验,重点是扩散和流量匹配模型。我们的贡献包括:1)AF-Synthetic,从音频理解模型中获得的高质量合成字幕的大型数据集; 2)TTA模型的不同架构、训练和推理设计选择的系统比较; 3)采样方法及其帕累托曲线分析相对于生成质量和推理速度。我们利用从这种广泛的分析中获得的知识,提出了我们最好的模型,称为阐明文本到音频(ETTA)。当在AudioCaps和MusicCaps上进行评估时,ETTA提供了对公开数据训练的基线的改进,同时与专有数据训练的模型竞争。最后,我们展示了ETTA在复杂和富有想象力的字幕之后生成创意音频的改进能力-这是一项比当前基准更具挑战性的任务。
摘要:Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.
标题: 提高人工智能合成语音检测的通用性
链接:https://arxiv.org/abs/2412.19279
备注:AAAI25
摘要:人工智能合成语音技术有可能为有益的应用创造真实的人类声音,但它也可能被滥用于恶意目的。虽然现有的人工智能合成语音检测模型在域内评估方面表现出色,但它们在跨不同领域推广方面面临挑战,随着新的语音生成器的出现,它们可能会过时。当前的解决方案使用不同的数据和先进的机器学习技术(例如,域不变表示、自监督学习),但是受限于预定义的声码器以及对像背景噪声和说话者身份这样的因素的敏感性。在这项工作中,我们引入了一个创新的解纠缠框架,旨在提取域不可知的文物功能相关的声码器。利用这些功能,我们增强了平坦损失环境中的模型学习,从而能够避免次优解决方案并提高泛化能力。在基准测试上的大量实验表明,我们的方法优于最先进的方法,在域内的等错误率度量和跨域评估中分别实现了高达5.12%和7.59%的改进。
摘要:AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), but are limited by predefined vocoders and sensitivity to factors like background noise and speaker identity. In this work, we introduce an innovative disentanglement framework aimed at extracting domain-agnostic artifact features related to vocoders. Utilizing these features, we enhance model learning in a flat loss landscape, enabling escape from suboptimal solutions and improving generalization. Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.
标题: 基于双尺度注意力的元学习的个性化动态音乐情感识别
链接:https://arxiv.org/abs/2412.19200
备注:Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)
摘要:动态音乐情感识别(DMER)旨在预测音乐中不同时刻的情感,在音乐信息检索中起着至关重要的作用。现有的DMER方法在处理序列数据时难以捕获长期依赖性,这限制了它们的性能。此外,这些方法往往忽略了个体差异对情绪感知的影响,即使每个人都有自己的个性化情绪感知在现实世界中。出于这些问题的动机,我们探索更有效的序列处理方法,并引入个性化DMER(PDMER)问题,这需要模型来预测与个性化感知一致的情绪。具体来说,我们提出了一种基于双尺度注意力的元学习(DSAML)方法。该方法融合了双尺度特征提取器的特征,并使用双尺度注意力Transformer捕获短期和长期依赖关系,提高了传统DMER的性能。为了实现PDMER,我们设计了一种新的任务构造策略,划分任务的注释。任务中的样本由相同的注释器注释,确保一致的感知。利用这种策略和元学习,DSAML可以预测个性化的情绪感知,只需一个个性化的注释样本。我们的客观和主观实验表明,我们的方法可以实现国家的最先进的性能在传统的DMER和PDMER。
摘要:Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.
标题: 凝聚舞者:通过音乐驱动的凝聚力分解增强互动群舞生成
链接:https://arxiv.org/abs/2412.19123
摘要:舞蹈生成是至关重要和具有挑战性的,特别是在舞蹈表演和虚拟游戏等领域。在目前的文献中,大多数方法论都集中在独奏音乐2Dance上。虽然有针对集体音乐舞蹈的努力,但这些努力往往缺乏连贯性,导致舞蹈表演美学上的缺陷。因此,我们介绍Cohedancers,一个新的框架,音乐驱动的互动群舞生成。Cohedancers旨在通过将其分解为三个关键方面来增强集体舞的连贯性:同步性,自然性和流动性。相应地,我们开发了基于周期一致性的舞蹈同步策略来促进音乐与舞蹈的对应,基于自回归的暴露偏差纠正策略来增强生成舞蹈的流动性,以及对抗性训练策略来增强自然性团体舞输出。总的来说,这些策略使CohdeDancers能够产生高质量的高度连贯的集体舞蹈。此外,为了为Group Music 2Dance建立更好的基准,我们构建了迄今为止最多样化和最全面的开源数据集I-Dancers,具有丰富的舞者互动,并创建了全面的评估指标。对I-Dancers和其他现存数据集的实验评估证实了CoheDancers实现了前所未有的最先进性能。将发布代码。
摘要:Dance generation is crucial and challenging, particularly in domains like dance performance and virtual gaming. In the current body of literature, most methodologies focus on Solo Music2Dance. While there are efforts directed towards Group Music2Dance, these often suffer from a lack of coherence, resulting in aesthetically poor dance performances. Thus, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. CoheDancers aims to enhance group dance generation coherence by decomposing it into three key aspects: synchronization, naturalness, and fluidity. Correspondingly, we develop a Cycle Consistency based Dance Synchronization strategy to foster music-dance correspondences, an Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity of the generated dances, and an Adversarial Training Strategy to augment the naturalness of the group dance output. Collectively, these strategies enable CohdeDancers to produce highly coherent group dances with superior quality. Furthermore, to establish better benchmarks for Group Music2Dance, we construct the most diverse and comprehensive open-source dataset to date, I-Dancers, featuring rich dancer interactions, and create comprehensive evaluation metrics. Experimental evaluations on I-Dancers and other extant datasets substantiate that CoheDancers achieves unprecedented state-of-the-art performance. Code will be released.
标题: BSDB-Net:带分裂双分支网络,具有选择性状态空间机制,用于单耳语音增强
链接:https://arxiv.org/abs/2412.19099
备注:Accepted by AAAI 2025
摘要:虽然基于复谱的语音增强(SE)方法已经取得了显著的性能,但是幅度和相位的耦合会导致补偿效应,其中牺牲幅度信息来补偿对SE有害的相位。此外,为了进一步提高SE的性能,许多模块堆叠在SE上,导致模型复杂度增加,限制了SE的应用。为了解决这些问题,提出了一种基于Mamba压缩频率的双路径网络。首先,通过并行双分支提取幅度和相位信息。该方法利用结构化复频谱隐式地捕获相位信息,并通过解耦幅度和相位来解决补偿效应,网络中加入交互模块来抑制不必要的部分并从另一分支恢复丢失的分量。其次,为了降低网络复杂度,该网络引入了频带分割策略来压缩频率维。为了在保持良好性能的同时进一步降低复杂度,我们设计了一个基于Mamba的模块,该模块在线性复杂度下对时间和频率维度进行建模。最后,与基准相比,我们的模型在保持卓越性能的同时,平均将计算复杂度降低了8.3倍。此外,与基于变换器的模型相比,它实现了25倍的复杂度降低。
摘要:Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models.
标题: 利用多语言STEN-TTC和Bert LID的印度尼西亚-英语代码转换语音合成器
链接:https://arxiv.org/abs/2412.19043
备注:Accepted at O-COCOSDA 2024
摘要:多语言文本到语音系统跨多种语言将文本转换为语音。在许多情况下,文本句子可能包含不同语言的片段,这种现象称为语码转换。这在印度尼西亚尤其常见,尤其是在印度尼西亚语和英语之间。尽管它的意义,还没有研究开发出一个多语种的TTS系统能够处理这两种语言之间的代码切换。本研究旨在探讨STEN-TTS中的汉英语码转换现象。关键的修改包括使用微调的BERT进行每个单词的语言识别,以及从基本模型中删除语言嵌入,将语言识别组件添加到文本到音素转换中。实验结果表明,代码转换模型实现了优越的自然度和改善的语音清晰度相比,印尼和英语基线STEN-TTS模型。
摘要:Multilingual text-to-speech systems convert text into speech across multiple languages. In many cases, text sentences may contain segments in different languages, a phenomenon known as code-switching. This is particularly common in Indonesia, especially between Indonesian and English. Despite its significance, no research has yet developed a multilingual TTS system capable of handling code-switching between these two languages. This study addresses Indonesian-English code-switching in STEN-TTS. Key modifications include adding a language identification component to the text-to-phoneme conversion using finetuned BERT for per-word language identification, as well as removing language embedding from the base model. Experimental results demonstrate that the code-switching model achieves superior naturalness and improved speech intelligibility compared to the Indonesian and English baseline STEN-TTS models.
标题: Leave-One-Equival:减轻对比音乐表示中与不变性相关的信息损失
链接:https://arxiv.org/abs/2412.18955
摘要:对比学习在自我监督的音乐表征学习中被证明是有效的,特别是对于音乐信息检索(MIR)任务。然而,依赖于增强链的对比视图生成和由此产生的学习不变性提出了挑战时,不同的下游任务需要对某些音乐属性的敏感性。为了解决这个问题,我们提出了Leave One Equivariant(LOEV)框架,与以前的工作相比,该框架通过选择性地保留有关特定增强的信息,引入了一种灵活的任务自适应方法,允许模型保持任务相关的等方差。我们证明了LOEV消除了与学习不变性相关的信息损失,提高了增强相关任务和检索的性能,而不会牺牲一般表示质量。此外,我们引入了LOEV的一个变体LOEV++,它通过设计以自监督的方式构建了一个解纠缠的潜在空间,并基于增强相关属性实现了有针对性的检索。
摘要:Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.
标题: 稳健的目标说话人到达方向估计
链接:https://arxiv.org/abs/2412.18913
摘要:在多说话人环境中,目标说话人的波达方向(DOA)是提高语音清晰度和提取目标说话人语音的关键。然而,传统的DOA估计方法往往在噪声、混响的存在下,特别是当存在竞争扬声器时,会遇到困难。为了解决这些挑战,我们提出了RTS-DOA,一个强大的实时DOA估计系统。该系统创新性地使用目标说话人的注册语音作为参考,并利用来自麦克风阵列的全频带和子频带频谱信息来估计目标说话人的语音的DOA。具体地,该系统包括用于初始改善语音质量的语音增强模块、用于学习空间信息的空间模块和用于提取声纹特征的说话者模块。LibriSpeech数据集上的实验结果表明,我们的RTS-DOA系统有效地处理多说话人的情况下,建立了新的最佳基准。
摘要:In multi-speaker environments the direction of arrival (DOA) of a target speaker is key for improving speech clarity and extracting target speaker's voice. However, traditional DOA estimation methods often struggle in the presence of noise, reverberation, and particularly when competing speakers are present. To address these challenges, we propose RTS-DOA, a robust real-time DOA estimation system. This system innovatively uses the registered speech of the target speaker as a reference and leverages full-band and sub-band spectral information from a microphone array to estimate the DOA of the target speaker's voice. Specifically, the system comprises a speech enhancement module for initially improving speech quality, a spatial module for learning spatial information, and a speaker module for extracting voiceprint features. Experimental results on the LibriSpeech dataset demonstrate that our RTS-DOA system effectively tackles multi-speaker scenarios and established new optimal benchmarks.
标题: 利用新型方法和MultiNAM数据集推进NAM到语音的转换
链接:https://arxiv.org/abs/2412.18839
备注:Accepted at IEEE ICASSP 2025
摘要:目前的非可听杂音(NAM)到语音技术依赖于语音克隆来模拟成对耳语的地面真实语音。然而,模拟的语音往往缺乏可懂度,并且不能很好地概括不同的说话者。为了解决这个问题,我们专注于从成对的耳语和文本中学习音素级对齐,并采用文本到语音(TTS)系统来模拟地面实况。为了减少对耳语的依赖,我们直接从NAM学习音素对齐,尽管质量受到可用训练数据的限制。为了进一步减轻对NAM/耳语数据的依赖,我们建议结合嘴唇模态来推断语音,并引入一种新的基于扩散的方法,该方法利用了嘴唇到语音技术的最新进展。此外,我们发布了MultiNAM数据集,其中包含来自两个扬声器的配对NAM,耳语,视频和文本数据,并在此数据集上对所有方法进行基准测试。语音样本和数据集可在\url{https://diff-nam.github.io/DiffNAM/}
摘要:Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over $7.96$ hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at \url{https://diff-nam.github.io/DiffNAM/}
标题: MRI 2 Speech:实时MRI记录的关节运动的语音合成
链接:https://arxiv.org/abs/2412.18836
备注:Accepted at IEEE ICASSP 2025
摘要:以前的实时MRI(rtMRI)为基础的语音合成模型严重依赖于嘈杂的地面实况语音。直接在地面真实梅尔频谱图上应用损失会使语音内容与MRI噪声纠缠在一起,导致可懂度差。我们引入了一种新的方法,该方法采用多模态自监督AV-HuBERT模型进行rtMRI文本预测,并采用了一种新的基于流的持续时间预测器进行特定于说话人的对齐。然后,语音解码器使用预测的文本和持续时间来合成任何新语音中的对齐语音。我们在两个数据集上进行了彻底的实验,并证明了我们的方法对看不见的说话者的泛化能力。我们通过掩蔽rtMRI视频的一部分来评估我们框架的性能,以评估不同发音器对文本预测的影响。我们的方法在USC-TIMIT MRI语料库上实现了15.18\%$的字错误率(WER),这标志着对当前最先进技术的巨大改进。语音样本可在\url{https://mri2speech.github.io/MRI2Speech/}获得
摘要:Previous real-time MRI (rtMRI)-based speech synthesis models depend heavily on noisy ground-truth speech. Applying loss directly over ground truth mel-spectrograms entangles speech content with MRI noise, resulting in poor intelligibility. We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI and incorporates a new flow-based duration predictor for speaker-specific alignment. The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice. We conduct thorough experiments on two datasets and demonstrate our method's generalization ability to unseen speakers. We assess our framework's performance by masking parts of the rtMRI video to evaluate the impact of different articulators on text prediction. Our method achieves a $15.18\%$ Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art. Speech samples are available at \url{https://mri2speech.github.io/MRI2Speech/}
标题: 通过多尺度多模式上下文交互实现具有表达力的视频配音
链接:https://arxiv.org/abs/2412.18748
备注:Accepted by ICSSP 2025
摘要:自动视频配音(AVD)从脚本中生成与嘴唇运动和面部情绪一致的语音。近年来的研究主要集中在对多模态语境进行建模以增强韵律表达能力,但忽略了两个关键问题:1)语境中的多尺度韵律表达属性对当前句子的韵律产生影响。2)上下文中的韵律线索与当前句子相互作用,影响最终的韵律表达力。为了应对这些挑战,我们提出了M2 CI-Dubber,一个多尺度多模态上下文交互方案AVD。该方案包括两个共享的M2 CI编码器,用于对多尺度多模态上下文进行建模,并促进其与当前句子的深度交互。该方法通过提取上下文中每种模态的全局和局部特征,利用基于注意力的聚合和交互机制,采用基于交互的图注意力网络进行融合,增强了当前句子合成语音的韵律表达能力。在Chem数据集上的实验表明,我们的模型在配音表现力方面优于基线。代码和演示可在\textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}获得。
摘要:Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.
标题: 用于对话语音合成的模式内和模式间上下文交互建模
链接:https://arxiv.org/abs/2412.18733
备注:Accepted by ICASSP 2025
摘要:会话语音合成(CSS)旨在有效地利用多模态对话历史(MDH)生成具有适当会话韵律的语音。CSS的关键挑战是对MDH和目标话语之间的交互进行建模。MDH中的语篇模态和言语模态各有其独特的影响,它们相辅相成,对目标语产生综合影响。以前的作品没有明确的模型,这样的模态内和模态间的相互作用。为了解决这个问题,我们提出了一个新的模态内和模态间的上下文交互模式为基础的CSS系统,称为III-CSS。具体而言,在训练阶段,我们将MDH与目标话语中的文本和语音模态相结合,以获得四种模态组合,包括历史文本-下一文本,历史语音-下一语音,历史文本-下一语音和历史语音-下一文本。然后,我们设计了两个基于对比学习的模态内和两个模态间交互模块,以深入学习模态内和模态间的上下文交互。在推理阶段,我们采取MDH和采用训练的交互模块,充分推断目标话语的文本内容的语音韵律。在DailyTalk数据集上进行的主观和客观实验表明,III-CSS在韵律表现力方面优于高级基线。代码和语音示例可在https://github.com/AI-S2-Lab/I3CSS上获得。
摘要:Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.
标题: Simi-SFX:一种基于相似度的可控音效合成方法
链接:https://arxiv.org/abs/2412.18710
摘要:生成具有可控变化的声音效果是一项具有挑战性的任务,传统上使用需要深入了解信号处理参数和算法的复杂物理模型来解决。在生成和大型语言模型的时代,文本已经成为控制声音合成的通用、人类可解释的界面。然而,语言标记的离散和定性性质使得难以捕捉不同声音之间的细微音色变化。在这项研究中,我们提出了一种新的相似性为基础的空调方法的声音合成,利用可微数字信号处理(DDSP)。该方法结合使用潜在空间学习和控制音频音色与直观的指导向量,在范围[0,1]内归一化,以编码分类声学信息。通过利用预先训练的音频表示模型,我们的方法实现了表达和细粒度的音色控制。为了对我们的方法进行基准测试,我们引入了两个声音效果数据集-足迹集和影响集-旨在评估可控性和声音质量。回归分析表明,建议的相似性得分有效地控制音色的变化,并使创造性的应用,如离散类之间的音色插值。我们的工作为音效合成提供了一个强大而通用的框架,弥合了传统信号处理和现代机器学习技术之间的差距。
摘要:Generating sound effects with controllable variations is a challenging task, traditionally addressed using sophisticated physical models that require in-depth knowledge of signal processing parameters and algorithms. In the era of generative and large language models, text has emerged as a common, human-interpretable interface for controlling sound synthesis. However, the discrete and qualitative nature of language tokens makes it difficult to capture subtle timbral variations across different sounds. In this research, we propose a novel similarity-based conditioning method for sound synthesis, leveraging differentiable digital signal processing (DDSP). This approach combines the use of latent space for learning and controlling audio timbre with an intuitive guiding vector, normalized within the range [0,1], to encode categorical acoustic information. By utilizing pre-trained audio representation models, our method achieves expressive and fine-grained timbre control. To benchmark our approach, we introduce two sound effect datasets--Footstep-set and Impact-set--designed to evaluate both controllability and sound quality. Regression analysis demonstrates that the proposed similarity score effectively controls timbre variations and enables creative applications such as timbre interpolation between discrete classes. Our work provides a robust and versatile framework for sound effect synthesis, bridging the gap between traditional signal processing and modern machine learning techniques.
标题: 下一个面向多模式智能的代币预测:全面调查
链接:https://arxiv.org/abs/2412.18619
备注:69 papes, 18 figures, repo at this https URL
摘要:基于自然语言处理中语言建模的基础,下一个标记预测(NTP)已经发展成为各种模式机器学习任务的通用训练目标,取得了相当大的成功。随着大型语言模型(LLM)在文本模态中统一理解和生成任务的发展,最近的研究表明,来自不同模态的任务也可以有效地封装在NTP框架中,将多模态信息转换为令牌,并在给定上下文的情况下预测下一个。这项调查介绍了一个全面的分类,统一的理解和生成多模态学习通过NTP的镜头。提出的分类法涵盖了五个关键方面:多模式标记化,MMNTP模型架构,统一任务表示,数据集\评估和开放的挑战。这种新的分类法旨在帮助研究人员探索多模态智能。收集最新论文和repos的相关GitHub存储库可在https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction上获得
摘要:Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \& evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction