语音/音频处理学术速递[12.23]

文摘 2024-12-23 18:02 北京

今日论文合集：cs.SD语音8篇，eess.AS音频处理13篇。

本文经arXiv每日学术速递授权转载

微信公众号：arXiv_Daily

cs.SD语音

【1】Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

标题：以数据为中心的改进，以增强口语对话建模中的多模式理解
链接：https://arxiv.org/abs/2412.15995

作者：Maximillian Chen, Ruoxi Sun, Sercan Ö. Arık
备注：22 pages, 6 figures, 14 tables
摘要：会话助理在各种现实应用中越来越受欢迎，这凸显了对高级多模态语音建模的需求。语音作为一种自然的交流方式，编码了丰富的用户特定特征，如语速和音高，这使得它对有效的交互至关重要。我们的工作介绍了一个以数据为中心的定制方法，有效地提高多模态理解会话语音建模。我们的贡献的核心是一个新的多任务学习范式，涉及设计辅助任务，利用少量的语音数据。我们的方法在Spoken-SQuAD基准测试中实现了最先进的性能，仅使用10%的训练数据和开放权重模型，为以音频为中心的会话建模建立了一个强大而高效的框架。我们还介绍了ASK-QA，这是第一个针对具有模糊用户请求和动态评估输入的多轮口语对话的数据集。代码和数据即将发布。
摘要：Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs. Code and data forthcoming.

【2】 RiTTA: Modeling Event Relations in Text-to-Audio Generation
标题：RiTTA：文本到音频生成中的事件关系建模
链接：https://arxiv.org/abs/2412.15922

作者：Yuhang He, Yash Jain, Xubo Liu, Andrew Markham, Vibhav Vineet
备注：Audio Events Relation Modeling in TTA Generative Model. Code: this https URL
摘要：尽管文本到音频（TTA）生成模型取得了显着进步，实现了高保真音频和细粒度上下文理解，但它们很难对输入文本中描述的音频事件之间的关系进行建模。然而，以前的TTA方法没有系统地探索音频事件关系建模，也没有提出框架来增强这种能力。本文系统地研究了TTA生成模型中音频事件关系建模问题。我们首先通过以下方式为这项任务建立一个基准：1。提出了一个全面的关系语料库，涵盖了现实世界场景中所有潜在的关系; 2.引入包含通常听到的音频的新的音频事件语料库;以及3.提出了新的评估度量以从各种角度评估音频事件关系建模。此外，我们提出了一个微调框架，以提高现有的TTA模型的音频事件关系建模能力。代码可从以下网址获得：https://github.com/yuhanghe01/RiTTA
摘要：Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA

【3】 Music Genre Classification: Ensemble Learning with Subcomponents-level Attention
标题：音乐流派分类：以子组件级别的注意力进行全面学习
链接：https://arxiv.org/abs/2412.15602

作者：Yichen Liu, Abhijit Dasgupta, Qiwei He
摘要：音乐类型分类是音乐信息检索和数字信号处理领域的研究热点之一。深度学习已成为各种方法中对音乐流派进行分类的最佳表现者。这封信介绍了一种新的方法，将集成学习与关注子组件相结合，旨在提高识别音乐流派的准确性。我们工作的核心创新是建议将音乐作品的子组件单独分类，使我们的模型能够从这些子组件中捕获不同的特征。通过将集成学习技术应用于这些单独的分类，我们对音乐的类型做出最终的分类决定。与在GTZAN数据集上训练和测试的其他最先进的技术相比，所提出的方法在准确性方面具有优越的优势。
摘要：Music Genre Classification is one of the most popular topics in the fields of Music Information Retrieval (MIR) and digital signal processing. Deep Learning has emerged as the top performer for classifying music genres among various methods. The letter introduces a novel approach by combining ensemble learning with attention to sub-components, aiming to enhance the accuracy of identifying music genres. The core innovation of our work is the proposal to classify the subcomponents of the music pieces separately, allowing our model to capture distinct characteristics from those sub components. By applying ensemble learning techniques to these individual classifications, we make the final classification decision on the genre of the music. The proposed method has superior advantages in terms of accuracy compared to the other state-of-the-art techniques trained and tested on the GTZAN dataset.

【4】 Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
标题：驯服多模式联合训练以实现高质量的视频到音频合成
链接：https://arxiv.org/abs/2412.15322

作者：Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji
备注：Project page: this https URL
摘要：我们建议使用新型多模式联合训练框架MMAudio，在给定视频和可选文本条件的情况下合成高质量且同步的音频。与仅以（有限的）视频数据为条件的单模态训练相比，MMAudio与更大规模的、现成的文本音频数据联合训练，以学习生成语义对齐的高质量音频样本。此外，我们提高了视听同步与条件同步模块，使视频条件与音频潜伏在帧级。通过流匹配目标训练，MMAudio在音频质量、语义对齐和视听同步方面实现了公共模型中最先进的视频到音频，同时具有较低的推理时间（1.23秒生成8秒剪辑）和仅157 M参数。MMAudio在文本到音频生成方面也取得了令人惊讶的竞争力，这表明联合训练不会阻碍单模态性能。代码和演示可在https://hkchengrex.github.io/MMAudio上获得
摘要：We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

【5】 LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration
标题：LAMA-UT：通过拼写统一和字母特定的音译实现语言不可知的多语言ASB
链接：https://arxiv.org/abs/2412.15299

作者：Sangmin Lee, Woo-Jin Chung Hong-Goo Kang
摘要：由于其固有的困难，建立一个通用的多语言自动语音识别（ASR）模型，使其在不同语言之间表现公平，长期以来一直是一个挑战。为了解决这个问题，我们引入了一个通过正字法统一和语言特定的音译（LAMA-UT）的不可知的多语言ASR管道。LAMA-UT在没有任何语言特定模块的情况下运行，同时匹配在最少量数据上训练的最先进模型的性能。我们的管道包括两个关键步骤。首先，我们利用一个通用的转录生成器，以统一的正字法功能到罗马化的形式，并捕捉不同语言的共同语音特征。其次，我们利用一个通用的转换器，将这些通用的transmits转换成特定语言的。在实验中，我们证明了我们提出的方法的有效性，利用通用transmittance大规模多语言ASR。与Whisper相比，我们的管道实现了45%的相对错误减少率，并执行了MMS到MMS的转换，尽管只使用了Whisper训练数据的0.1%。此外，我们的管道不依赖于任何特定语言的模块。然而，它的性能与利用额外的语言特定词典和语言模型的zero-shot ASR方法相当。我们希望这个框架可以作为灵活的多语言ASR系统的基石，这些系统甚至可以推广到看不见的语言。
摘要：Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

【6】 Early Dementia Detection Using Multiple Spontaneous Speech Prompts: The PROCESS Challenge
标题：使用多种自发言语冲动进行早期痴呆症检测：过程挑战
链接：https://arxiv.org/abs/2412.15230

作者：Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Sophie Young, Yao Xiao, Hend Elghazaly, Fritz Peters, Caitlin Illingworth, Dorota Braun, Ronan O'Malley, Simon Bell, Daniel Blackburn, Fasih Haider, Saturnino Luz, Heidi Christensen
备注：2 pages, no figure, conference
摘要：痴呆症与各种认知障碍有关，通常仅在显著进展后才表现出来，使得在这个阶段的干预往往无效。为了解决这个问题，通过自发语音（PROCESS）信号处理大挑战预测和识别认知衰退邀请参与者专注于早期痴呆症检测。我们为这一挑战提供了一个新的自发语音语料库。该语料库包括由神经学家设计的三个提示的答案，以更好地捕捉说话者的认知。我们的基线模型在分类任务中的F1得分为55.0%，在回归任务中的RMSE为2.98。
摘要：Dementia is associated with various cognitive impairments and typically manifests only after significant progression, making intervention at this stage often ineffective. To address this issue, the Prediction and Recognition of Cognitive Decline through Spontaneous Speech (PROCESS) Signal Processing Grand Challenge invites participants to focus on early-stage dementia detection. We provide a new spontaneous speech corpus for this challenge. This corpus includes answers from three prompts designed by neurologists to better capture the cognition of speakers. Our baseline models achieved an F1-score of 55.0% on the classification task and an RMSE of 2.98 on the regression task.

【7】 SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
标题：SyncFlow：迈向从文本进行时间对齐的联合音频-视频生成
链接：https://arxiv.org/abs/2412.15220

作者：Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun Nagaraja, Wenwu Wang, Mark D. Plumbley, Yangyang Shi, Vikas Chandra
摘要：视频和音频是人类自然感知到的密切相关的模态。虽然最近的进步已经使得能够从文本生成音频或视频，但同时产生两种模态通常仍然依赖于级联过程或多模态对比编码器。然而，这些方法，往往会导致次优的结果，由于内在的信息损失在推理和调节。在本文中，我们介绍SyncFlow，一个系统，能够同时产生时间同步的音频和视频从文本。SyncFlow的核心是提出的双扩散-Transformer（d-DiT）架构，该架构能够通过适当的信息融合实现联合视频和音频建模。为了有效地管理联合音频和视频建模的计算成本，SyncFlow采用了一种多阶段训练策略，在联合微调之前将视频和音频学习分开。我们的实证评估表明，SyncFlow产生的音频和视频输出比基线方法更相关，具有显着增强的音频质量和视听对应。此外，我们展示了强大的zero-shot能力的SyncFlow，包括zero-shot视频到音频的生成和适应新的视频分辨率，而无需进一步的训练。
摘要：Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual-diffusion-transformer (d-DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi-stage training strategy that separates video and audio learning before joint fine-tuning. Our empirical evaluations demonstrate that SyncFlow produces audio and video outputs that are more correlated than baseline methods with significantly enhanced audio quality and audio-visual correspondence. Moreover, we demonstrate strong zero-shot capabilities of SyncFlow, including zero-shot video-to-audio generation and adaptation to novel video resolutions without further training.

【8】 Predicting Artificial Neural Network Representations to Learn Recognition Model for Music Identification from Brain Recordings
标题：预测人工神经网络表示以学习识别模型，用于从大脑记录中识别音乐
链接：https://arxiv.org/abs/2412.15560

作者：Taketo Akama, Zhuohao Zhang, Pengcheng Li, Kotaro Hongo, Hiroaki Kitano, Shun Minamikawa, Natalia Polouliakh
备注：18 pages, 10 figures
摘要：最近的研究表明，人工神经网络（ANN）的表示可以表现出显着的相似性，皮层表示时，受到相同的听觉感官输入。在这些研究中，预测皮质表示的能力是通过从ANN表示回归到皮质表示来探测的。基于这一概念，我们的方法扭转了预测的方向：我们利用人工神经网络表示作为一个监督信号，训练识别模型，通过非侵入性测量获得嘈杂的大脑记录。具体来说，我们专注于构建一个识别模型的音乐识别，在音乐聆听过程中收集的脑电图（EEG）的大脑记录作为输入。通过训练EEG识别模型来预测ANN表示-与音乐识别相关的表示-我们观察到分类准确性的大幅提高。本研究介绍了一种新的方法来开发识别模型的大脑记录响应外部听觉刺激。它有望推进脑机接口（BCI），神经解码技术和我们对音乐认知的理解。此外，它提供了新的见解听觉大脑活动和人工神经网络表示之间的关系。
摘要：Recent studies have demonstrated that the representations of artificial neural networks (ANNs) can exhibit notable similarities to cortical representations when subjected to identical auditory sensory inputs. In these studies, the ability to predict cortical representations is probed by regressing from ANN representations to cortical representations. Building upon this concept, our approach reverses the direction of prediction: we utilize ANN representations as a supervisory signal to train recognition models using noisy brain recordings obtained through non-invasive measurements. Specifically, we focus on constructing a recognition model for music identification, where electroencephalography (EEG) brain recordings collected during music listening serve as input. By training an EEG recognition model to predict ANN representations-representations associated with music identification-we observed a substantial improvement in classification accuracy. This study introduces a novel approach to developing recognition models for brain recordings in response to external auditory stimuli. It holds promise for advancing brain-computer interfaces (BCI), neural decoding techniques, and our understanding of music cognition. Furthermore, it provides new insights into the relationship between auditory brain activity and ANN representations.

eess.AS音频处理

【1】 Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers
标题：交织语音文本语言模型是简单的流文本到语音合成器
链接：https://arxiv.org/abs/2412.16102

作者：Yifan Yang, Ziyang Ma, Shujie Liu, Jinyu Li, Hui Wang, Lingwei Meng, Haiyang Sun, Yuzhe Liang, Ruiyang Xu, Yuxuan Hu, Yan Lu, Rui Zhao, Xie Chen
摘要：本文介绍了一种用于流媒体zero-shot文语转换（TTS）的交织语音-文本语言模型（IST-LM）。与许多以前的方法不同，IST-LM直接在具有固定比率的文本和语音令牌的交织序列上进行训练，从而无需在持续时间预测和字素到音素对齐方面进行额外的工作。文本块大小与语音块大小的比例对IST-LM的性能至关重要。为了探索这一点，我们对训练数据进行了一系列全面的统计分析，并与最终性能进行了相关性分析，发现了几个关键因素：1）语音令牌与其对应的文本令牌之间的距离，2）每个语音令牌可访问的未来文本令牌的数量，以及3）语音令牌的频率先于其对应的文本令牌。实验结果表明，如何实现一个最佳的流TTS系统，而不需要复杂的工程，我们表明有一个有限的差距与非流系统。IST-LM概念简单，经验强大，为流传输TTS铺平了道路，开销最小，同时在很大程度上保持了性能。
摘要：This paper introduces Interleaved Speech-Text Language Model (IST-LM) for streaming zero-shot Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed ratio, eliminating the need for additional efforts in duration prediction and grapheme-to-phoneme alignment. The ratio of text chunk size to speech chunk size is crucial for the performance of IST-LM. To explore this, we conducted a comprehensive series of statistical analyses on the training data and performed correlation analysis with the final performance, uncovering several key factors: 1) the distance between speech tokens and their corresponding text tokens, 2) the number of future text tokens accessible to each speech token, and 3) the frequency of speech tokens precedes their corresponding text tokens. Experimental results demonstrate how to achieve an optimal streaming TTS system without complicated engineering, which we show has a limited gap with the non-streaming system. IST-LM is conceptually simple and empirically powerful, paving the way for streaming TTS with minimal overhead while largely maintaining performance.

【2】 SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
标题：SLAM-Omni：具有单阶段训练的音色可控语音交互系统
链接：https://arxiv.org/abs/2412.15649

作者：Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen
摘要：最近的进展突出了端到端实时语音对话系统的潜力，展示了其低延迟和高质量。在本文中，我们介绍了SLAM-Omni，一个音色可控的，端到端的语音交互系统与单级培训。SLAM-Omni通过使用语义标记对口语进行建模并将说话者信息解耦到声码器来实现zero-shot音色控制。通过在每一步预测分组的语音语义标记，我们的方法显着减少了音频标记的序列长度，加快了训练和推理。此外，我们提出了历史文本提示，以压缩对话历史，促进高效的多轮互动。综合评估显示，SLAM-Omni的性能优于之前类似规模的模型，仅需要在4个GPU上进行15小时的训练，数据有限。值得注意的是，它是第一个通过单阶段培训方法实现竞争性性能的口语对话系统，无需进行TTS或ASR任务的预培训。进一步的实验验证了它的多语种和多轮对话能力在更大的数据集。
摘要：Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

【3】 TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
标题：TouchISP：每个人都可以触摸的弹性自动语音感知
链接：https://arxiv.org/abs/2412.15622

作者：Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng
备注：Technical Report
摘要：大型自动语音识别（ASR）模型在训练过程中需要大量的参数、大量的数据和大量的计算资源。然而，这些模型只能部署在高计算云平台上，并且只能执行语音识别任务。这导致高成本和有限的能力。在这份报告中，我们首先提出了弹性混合专家（eMoE）模型。该模型只需训练一次，然后根据部署要求进行弹性扩展。其次，我们设计了一个无监督的数据创建和验证程序，并从不同的领域收集了数百万小时的音频数据进行训练。使用这两种技术，我们的系统实现了弹性部署能力，同时将SpeechIO测试集上的字符错误率（CER）从4.98%降低到2.45%。第三，我们的模型不仅能够胜任普通话语音识别，而且还精通多语言，多方言，情感，性别和声音事件感知。我们称之为自动语音感知（ASP），感知结果在实验部分中给出。
摘要：Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.

【4】 Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
标题：转录和翻译，快与慢：联合语音翻译和识别
链接：https://arxiv.org/abs/2412.15415

作者：Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello, Zeeshan Ahmed, Frank Seide, Christian Fuegen
备注：Submitted to ICASSP 2025
摘要：我们提出了联合语音翻译和识别（JSTAR）模型，利用快速-慢速级联编码器架构，同时进行端到端自动语音识别（ASR）和语音翻译（ST）。该模型是基于传感器的，并使用多目标训练策略，同时优化ASR和ST目标。这使得JSTAR能够生成高质量的流式ASR和ST结果。我们将JSTAR应用于智能眼镜的双语对话语音设置中，其中该模型还经过训练，以区分与佩戴者和对话伙伴对应的不同方向的语音。研究了不同的模型预训练策略，以进一步改善结果，包括首次训练基于转换器的流式机器翻译（MT）模型，并将其应用于JSTAR的参数初始化。我们展示了优越的性能相比，一个强大的级联ST模型的JSTAR在BLEU分数和延迟。
摘要：We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.

【5】 Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling
标题：以数据为中心的改进，以增强口语对话建模中的多模式理解
链接：https://arxiv.org/abs/2412.15995

【6】 RiTTA: Modeling Event Relations in Text-to-Audio Generation
标题：RiTTA：文本到音频生成中的事件关系建模
链接：https://arxiv.org/abs/2412.15922

作者：Yuhang He, Yash Jain, Xubo Liu, Andrew Markham, Vibhav Vineet
备注：Audio Events Relation Modeling in TTA Generative Model. Code: this https URL
摘要：尽管文本到音频（TTA）生成模型取得了显着进步，实现了高保真音频和细粒度上下文理解，但它们很难对输入文本中描述的音频事件之间的关系进行建模。然而，以前的TTA方法没有系统地探索音频事件关系建模，也没有提出框架来增强这种能力。本文系统地研究了TTA生成模型中音频事件关系建模问题。我们首先通过以下方式为这项任务建立一个基准：1。提出了一个全面的关系语料库，涵盖了现实世界场景中所有潜在的关系; 2.引入包含通常听到的音频的新的音频事件语料库;以及3.提出新的评估指标来从各个角度评估音频事件关系建模。此外，我们提出了一个微调框架，以提高现有的TTA模型的音频事件关系建模能力。代码可从以下网址获得：https://github.com/yuhanghe01/RiTTA
摘要：Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA

【7】 Fine-tuning Whisper on Low-Resource Languages for Real-World Applications
标题：针对现实世界应用程序的低资源语言进行微调Whisper
链接：https://arxiv.org/abs/2412.15726

作者：Vincenzo Timmel, Claudio Paonessa, Reza Kakooee, Manfred Vogel, Daniel Perruchoud
摘要：本文提出了一种新的方法来微调OpenAI的Whisper模型，用于低资源语言，通过引入一种新的数据生成方法，将文档级数据转换为长格式语料库，使用瑞士德语作为案例研究。非句子级数据是提高长格式音频性能的有效手段，但由于其获取困难且受版权法的限制。我们的方法通过将更容易访问的句子级数据转换为一种格式来弥合这一差距，该格式保留了模型处理长格式音频和执行分割的能力，而不需要非句子级数据。我们的数据生成过程提高了几个现实世界的应用程序的性能，并导致瑞士德语的一个新的国家的最先进的语音到文本（STT）模型的发展。我们将我们的模型与未经微调的Whisper和我们以前最先进的瑞士德国STT模型进行比较，我们的新模型获得了更高的BLEU分数。我们的结果还表明，所提出的方法适用于其他低资源语言，并得到书面指导和代码的支持，这些指导和代码允许创建微调的Whisper模型，这些模型保持分割能力并允许仅使用句子转录较长的音频文件。高质量的级别数据。
摘要：This paper presents a new approach to fine-tuning OpenAI's Whisper model for low-resource languages by introducing a novel data generation method that converts sentence-level data into a long-form corpus, using Swiss German as a case study. Non-sentence-level data, which could improve the performance of long-form audio, is difficult to obtain and often restricted by copyright laws. Our method bridges this gap by transforming more accessible sentence-level data into a format that preserves the model's ability to handle long-form audio and perform segmentation without requiring non-sentence-level data. Our data generation process improves performance in several real-world applications and leads to the development of a new state-of-the-art speech-to-text (STT) model for Swiss German. We compare our model with a non-fine-tuned Whisper and our previous state-of-the-art Swiss German STT models, where our new model achieves higher BLEU scores. Our results also indicate that the proposed method is adaptable to other low-resource languages, supported by written guidance and code that allows the creation of fine-tuned Whisper models, which keep segmentation capabilities and allow the transcription of longer audio files using only sentence-level data with high quality.

【8】 Music Genre Classification: Ensemble Learning with Subcomponents-level Attention
标题：音乐流派分类：以子组件级别的注意力进行全面学习
链接：https://arxiv.org/abs/2412.15602

作者：Yichen Liu, Abhijit Dasgupta, Qiwei He
摘要：音乐类型分类是音乐信息检索和数字信号处理领域的研究热点之一。深度学习已经成为各种方法中对音乐类型进行分类的最佳表现者。这封信介绍了一种新的方法，将集成学习与关注子组件相结合，旨在提高识别音乐流派的准确性。我们工作的核心创新是建议将音乐作品的子组件单独分类，使我们的模型能够从这些子组件中捕获不同的特征。通过将集成学习技术应用于这些单独的分类，我们对音乐的类型做出最终的分类决定。与在GTZAN数据集上训练和测试的其他最先进的技术相比，所提出的方法在准确性方面具有优越的优势。
摘要：Music Genre Classification is one of the most popular topics in the fields of Music Information Retrieval (MIR) and digital signal processing. Deep Learning has emerged as the top performer for classifying music genres among various methods. The letter introduces a novel approach by combining ensemble learning with attention to sub-components, aiming to enhance the accuracy of identifying music genres. The core innovation of our work is the proposal to classify the subcomponents of the music pieces separately, allowing our model to capture distinct characteristics from those sub components. By applying ensemble learning techniques to these individual classifications, we make the final classification decision on the genre of the music. The proposed method has superior advantages in terms of accuracy compared to the other state-of-the-art techniques trained and tested on the GTZAN dataset.

【9】 Predicting Artificial Neural Network Representations to Learn Recognition Model for Music Identification from Brain Recordings
标题：预测人工神经网络表示以学习识别模型，用于从大脑记录中识别音乐
链接：https://arxiv.org/abs/2412.15560

【10】 Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
标题：驯服多模式联合训练以实现高质量的视频到音频合成
链接：https://arxiv.org/abs/2412.15322

作者：Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji
备注：Project page: this https URL
摘要：我们建议合成高质量和同步的音频，给定的视频和可选的文本条件下，使用一种新的多模式联合训练框架MMAudio。与仅以（有限的）视频数据为条件的单模态训练相比，MMAudio与更大规模的、现成的文本音频数据联合训练，以学习生成语义对齐的高质量音频样本。此外，我们提高了视听同步与条件同步模块，使视频条件与音频潜伏在帧级。通过流匹配目标训练，MMAudio在音频质量、语义对齐和视听同步方面实现了公共模型中最先进的视频到音频，同时具有较低的推理时间（1.23秒生成8秒剪辑）和仅157 M参数。MMAudio在文本到音频生成方面也取得了令人惊讶的竞争力，这表明联合训练不会阻碍单模态性能。代码和演示可在https://hkchengrex.github.io/MMAudio上获得
摘要：We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio

【11】 LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration
标题：LAMA-UT：通过拼写统一和字母特定的音译实现语言不可知的多语言ASB
链接：https://arxiv.org/abs/2412.15299

【12】 Early Dementia Detection Using Multiple Spontaneous Speech Prompts: The PROCESS Challenge
标题：使用多种自发言语冲动进行早期痴呆症检测：过程挑战
链接：https://arxiv.org/abs/2412.15230

【13】 SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
标题：SyncFlow：迈向从文本进行时间对齐的联合音频-视频生成
链接：https://arxiv.org/abs/2412.15220

机器翻译由腾讯交互翻译提供，仅供参考

永久福利直投简历

简历投递：join@speechhome.com

扫码关注我们

助力AI语音开发者的社区

语音之家

助力AI语音开发者的社区