语音/音频处理学术速递[12.19]

文摘   2024-12-19 18:05   北京  
今日论文合集:cs.SD语音9篇,eess.AS音频处理9篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音
【1】 NeckCare: Preventing Tech Neck using Hearable-based Multimodal Sensing
标题:NeckCare:使用基于听觉的多模式传感来预防技术瓶颈
链接:https://arxiv.org/abs/2412.13579
作者:Bhawana Chhaglani,  Alan Seefeldt
摘要:科技颈是一种由长时间使用设备引起的现代流行病,它会导致严重的颈部拉伤和不适。本文讨论了使用非侵入性无处不在的传感技术检测和预防科技颈综合征的挑战。我们介绍了NeckCare,一种利用听觉传感器(包括IMU和麦克风)的新型系统,用于实时监测技术颈部姿势并估计屏幕距离。通过分析来自15名参与者的音高,位移和声学测距数据,我们单独使用IMU数据实现了96%的姿势分类准确率,结合音频数据时达到了99%。我们的距离估计技术即使在嘈杂的条件下也能达到毫米级的精度。NeckCare为用户提供即时反馈,促进更健康的姿势,减少颈部紧张。未来的工作将探索个性化警报,预测肌肉紧张,整合颈部运动检测和增强数字眼疲劳预测。
摘要:Tech neck is a modern epidemic caused by prolonged device usage and it can lead to significant neck strain and discomfort. This paper addresses the challenge of detecting and preventing tech neck syndrome using non-invasive ubiquitous sensing techniques. We present NeckCare, a novel system leveraging hearable sensors, including IMUs and microphones, to monitor tech neck postures and estimate distance form screen in real-time. By analyzing pitch, displacement, and acoustic ranging data from 15 participants, we achieve posture classification accuracy of 96% using IMU data alone and 99% when combined with audio data. Our distance estimation technique is millimeter-level accurate even in noisy conditions. NeckCare provides immediate feedback to users, promoting healthier posture and reducing neck strain. Future work will explore personalizing alerts, predicting muscle strain, integrating neck exercise detection and enhancing digital eye strain prediction.

【2】 Tuning Music Education: AI-Powered Personalization in Learning Music
标题:调整音乐教育:音乐学习中的人工智能个性化
链接:https://arxiv.org/abs/2412.13514
作者:Mayank Sanganeria,  Rohan Gala
备注:38th Conference on Neural Information Processing Systems (NeurIPS 2024) Creative AI Track
摘要:最近,人工智能驱动的阶跃函数在音乐技术的几个长期问题上取得了进展,为创建下一代音乐教育工具开辟了新的途径。创造个性化,引人入胜和有效的学习体验是音乐教育中不断发展的挑战。在这里,我们提出了两个案例研究,利用音乐技术的进步,以应对这些挑战。在我们的第一个案例研究中,我们展示了一个应用程序,该应用程序使用自动和弦识别从音轨中生成个性化练习,将传统的耳朵训练与真实世界的音乐环境相结合。在第二个案例研究中,我们制作了自适应钢琴方法书的原型,这些书使用自动音乐转录来生成不同技能水平的练习,同时保持与音乐兴趣的密切联系。这些应用展示了最近的人工智能发展如何使高质量的音乐教育民主化,并在生成人工智能时代促进与音乐的丰富互动。我们希望这项工作能激励社区的其他努力,旨在消除获得高质量音乐教育的障碍,促进人类参与音乐表达。
摘要:Recent AI-driven step-function advances in several longstanding problems in music technology are opening up new avenues to create the next generation of music education tools. Creating personalized, engaging, and effective learning experiences are continuously evolving challenges in music education. Here we present two case studies using such advances in music technology to address these challenges. In our first case study we showcase an application that uses Automatic Chord Recognition to generate personalized exercises from audio tracks, connecting traditional ear training with real-world musical contexts. In the second case study we prototype adaptive piano method books that use Automatic Music Transcription to generate exercises at different skill levels while retaining a close connection to musical interests. These applications demonstrate how recent AI developments can democratize access to high-quality music education and promote rich interaction with music in the age of generative AI. We hope this work inspires other efforts in the community, aimed at removing barriers to access to high-quality music education and fostering human participation in musical expression.

【3】 SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation
标题:SAVGBBench:空间对齐的音频视频生成基准
链接:https://arxiv.org/abs/2412.13462
作者:Kazuki Shimada,  Christian Simon,  Takashi Shibuya,  Shusuke Takahashi,  Yuki Mitsufuji
备注:5 pages, 3 figures
摘要:这项工作解决了缺乏能够生成具有空间对齐音频的高质量视频的多模式生成模型的问题。虽然生成模型的最新进展在视频生成方面取得了成功,但它们往往忽略了音频和视觉之间的空间对齐,这对于沉浸式体验至关重要。为了解决这个问题,我们建立了一个新的研究方向,在基准空间对齐的音频视频生成(SAVG)。我们提出了基准测试的三个关键组成部分:数据集,基线和指标。我们介绍了一个空间对齐的视听数据集,来自视听数据集组成的多通道音频,视频和时空注释的声音事件。我们提出了一个基线视听扩散模型,专注于立体声视听联合学习,以适应空间声音。最后,我们提出了评估视频和空间音频质量的指标,包括一个新的空间视听对齐指标。我们的实验结果表明,在视频和音频质量以及两种模式之间的空间对齐方面,基线模型和地面实况之间存在差距。
摘要:This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking Spatially Aligned Audio-Video Generation (SAVG). We propose three key components for the benchmark: dataset, baseline, and metrics. We introduce a spatially aligned audio-visual dataset, derived from an audio-visual dataset consisting of multichannel audio, video, and spatiotemporal annotations of sound events. We propose a baseline audio-visual diffusion model focused on stereo audio-visual joint learning to accommodate spatial sound. Finally, we present metrics to evaluate video and spatial audio quality, including a new spatial audio-visual alignment metric. Our experimental result demonstrates that gaps exist between the baseline model and ground truth in terms of video and audio quality, and spatial alignment between both modalities.

【4】 Detecting Machine-Generated Music with Explainability -- A Challenge and  Early Benchmarks
标题:具有可解释性检测机器生成的音乐--挑战和早期基准
链接:https://arxiv.org/abs/2412.13421
作者:Yupei Li,  Qiyang Sun,  Hanqian Li,  Lucia Specia,  Björn W. Schuller
摘要:机器生成音乐(MGM)已成为一项突破性的创新,具有广泛的应用,如音乐治疗,个性化编辑和音乐行业的创意灵感。然而,米高梅的不受管制的扩散对娱乐,教育和艺术部门提出了相当大的挑战,可能会破坏高质量人类作品的价值。因此,MGM检测(MGMD)对于保持这些字段的完整性至关重要。尽管MGMD领域意义重大,但它缺乏推动有意义进展所需的全面基准结果。为了解决这一差距,我们使用一系列音频处理的基础模型对现有的大规模数据集进行实验,建立针对MGMD任务的基准结果。我们的选择包括传统的机器学习模型,深度神经网络,基于transformer的架构和状态空间模型(SSM)。认识到音乐固有的多模态性质,它集成了旋律和歌词,我们还探讨了基本的多模态模型在我们的实验。除了提供基本的二元分类结果之外,我们还使用多种可解释的人工智能(XAI)工具深入研究模型行为,提供对其决策过程的见解。我们的分析表明,ResNet18在域内和域外测试中表现最好。通过提供基准测试结果及其可解释性的全面比较,我们提出了几个方向,以启发未来的研究开发更强大和有效的检测方法MGM。
摘要:Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions. Consequently, MGM detection (MGMD) is crucial for preserving the integrity of these fields. Despite its significance, MGMD domain lacks comprehensive benchmark results necessary to drive meaningful progress. To address this gap, we conduct experiments on existing large-scale datasets using a range of foundational models for audio processing, establishing benchmark results tailored to the MGMD task. Our selection includes traditional machine learning models, deep neural networks, Transformer-based architectures, and State Space Models (SSM). Recognising the inherently multimodal nature of music, which integrates both melody and lyrics, we also explore fundamental multimodal models in our experiments. Beyond providing basic binary classification outcomes, we delve deeper into model behaviour using multiple explainable Aritificial Intelligence (XAI) tools, offering insights into their decision-making processes. Our analysis reveals that ResNet18 performs the best according to in-domain and out-of-domain tests. By providing a comprehensive comparison of benchmark results and their interpretability, we propose several directions to inspire future research to develop more robust and effective detection methods for MGM.

【5】 Synthetic Speech Classification: IEEE Signal Processing Cup 2022  challenge
标题:合成语音分类:2022年IEEE信号处理杯挑战
链接:https://arxiv.org/abs/2412.13279
作者:Mahieyin Rahmun,  Rafat Hasan Khan,  Tanjim Taharat Aurpa,  Sadia Khan,  Zulker Nayeen Nahiyan,  Mir Sayad Bin Almas,  Rakibul Hasan Rajib,  Syeda Sakira Hassan
摘要:该项目的目的是为IEEE Signal ProcessingCup 2022挑战赛实现和设计鲁棒的合成语音分类器。在这里,我们学习一个合成语音attributationmodel使用语音生成的各种文本到语音(TTS)算法以及未知的TTS算法。我们实验了经典的机器学习方法,如支持向量机,高斯混合模型,以及基于深度学习的方法,如ResNet,VGG16和两个浅端到端网络。我们观察到,基于深度学习的方法与原始数据表现出最佳的性能。
摘要:The aim of this project is to implement and design arobust synthetic speech classifier for the IEEE Signal ProcessingCup 2022 challenge. Here, we learn a synthetic speech attributionmodel using the speech generated from various text-to-speech(TTS) algorithms as well as unknown TTS algorithms. Weexperiment with both the classical machine learning methodssuch as support vector machine, Gaussian mixture model, anddeep learning based methods such as ResNet, VGG16, and twoshallow end-to-end networks. We observe that deep learningbased methods with raw data demonstrate the best performance.

【6】 Investigating the Effects of Diffusion-based Conditional Generative  Speech Models Used for Speech Enhancement on Dysarthric Speech
标题:研究用于语音增强的基于扩散的条件生成语音模型对发音障碍语音的影响
链接:https://arxiv.org/abs/2412.13933
作者:Joanna Reszka,  Parvaneh Janbakhshi,  Tilak Purohit,  Sadegh Mohammadi
备注:Accepted at ICASSP 2025 Satellite Workshop: Workshop on Speech Pathology Analysis and DEtection (SPADE)
摘要:在这项研究中,我们的目标是探索预训练的条件生成语音模型的效果,第一次对构音障碍的语音由于帕金森氏病记录在一个理想的/无噪声的条件。考虑一类生成模型,即,基于扩散的语音增强,这些模型之前经过训练以学习干净(即在无噪音环境中记录)典型语音信号的分布。因此,我们假设,当被暴露在构音障碍的讲话,他们可能会删除看不见的非典型的非语言学线索在增强过程中。通过考虑自动构音障碍语音检测任务,在这项研究中,我们的实验表明,在增强过程中的构音障碍语音数据记录在一个理想的非噪声环境中,一些声学构音障碍语音线索丢失。因此,这种预先训练的模型还不适合构音障碍语音增强的上下文中,因为它们在处理干净的构音障碍语音时操纵病理语音线索。此外,我们表明,删除的声学线索的增强模型的形式残留的语音信号可以提供互补的构音障碍的线索时,与原始的输入语音信号在特征空间中融合。
摘要:In this study, we aim to explore the effect of pre-trained conditional generative speech models for the first time on dysarthric speech due to Parkinson's disease recorded in an ideal/non-noisy condition. Considering one category of generative models, i.e., diffusion-based speech enhancement, these models are previously trained to learn the distribution of clean (i.e, recorded in a noise-free environment) typical speech signals. Therefore, we hypothesized that when being exposed to dysarthric speech they might remove the unseen atypical paralinguistic cues during the enhancement process. By considering the automatic dysarthric speech detection task, in this study, we experimentally show that during the enhancement process of dysarthric speech data recorded in an ideal non-noisy environment, some of the acoustic dysarthric speech cues are lost. Therefore such pre-trained models are not yet suitable in the context of dysarthric speech enhancement since they manipulate the pathological speech cues when they process clean dysarthric speech. Furthermore, we show that the removed acoustics cues by the enhancement models in the form of residue speech signal can provide complementary dysarthric cues when fused with the original input speech signal in the feature space.

【7】 Speech Watermarking with Discrete Intermediate Representations
标题:具有离散中间表示的语音水印
链接:https://arxiv.org/abs/2412.13917
作者:Shengpeng Ji,  Ziyue Jiang,  Jialong Zuo,  Minghui Fang,  Yifu Chen,  Tao Jin,  Zhou Zhao
备注:Accepted by AAAI 2025
摘要:语音水印技术可以主动减轻即时语音克隆技术的潜在危害。这些技术涉及将信号插入到人类无法感知但可以通过算法检测到的语音中。以前的方法通常将水印消息嵌入到连续空间中。然而,直观地说,将水印信息嵌入到鲁棒的离散潜在空间中可以显著提高水印系统的鲁棒性。在本文中,我们提出了离散WM,一种新的语音水印框架,注入水印到离散的中间表示的语音。具体来说,我们将语音映射到离散的潜在空间与矢量量化的自编码器和注入水印通过改变离散ID的模算术关系。为了保证水印的不可感知性,我们还提出了一个机械手模型来选择水印嵌入的候选标记。实验结果表明,我们的框架实现了国家的最先进的性能在鲁棒性和不可感知性,同时。此外,我们灵活的帧方式可以作为一个有效的解决方案,语音克隆检测和信息隐藏。此外,DiscreteWM可以在1秒的语音片段内编码1到150比特的水印信息,这表明其编码能力。音频样本可在https://DiscreteWM.github.io/discrete_wm上获得。
摘要:Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vector-quantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity. Audio samples are available at https://DiscreteWM.github.io/discrete_wm.

【8】 SongEditor: Adapting Zero-Shot Song Generation Language Model as a  Multi-Task Editor
标题:歌曲编辑器:将Zero-Shot歌曲生成语言模型改编为多任务编辑器
链接:https://arxiv.org/abs/2412.13786
作者:Chenyu Yang,  Shuai Wang,  Hangting Chen,  Jianwei Yu,  Wei Tan,  Rongzhi Gu,  Yaoxun Xu,  Yizhi Zhou,  Haina Zhu,  Haizhou Li
备注:Accepted by AAAI2025
摘要:新的生成建模范式的出现,特别是音频语言模型,大大推进了歌曲生成领域。虽然最先进的模型能够同时合成长达几分钟的人声和伴奏音轨,但对现有歌曲的部分调整或编辑的研究仍然不足,这使得制作更加灵活和有效。在本文中,我们提出了SongEditor,第一首歌曲编辑范例,引入编辑功能的语言建模歌曲生成方法,促进段明智和轨道明智的修改。SongEditor提供了灵活的调整歌词,人声,和配乐,以及从头开始合成歌曲。SongEditor的核心组件包括一个音乐标记器、一个自回归语言模型和一个扩散生成器,可以生成整个部分、掩蔽的歌词,甚至是分离的人声和背景音乐。大量的实验表明,所提出的SongEditor在端到端的歌曲编辑,客观和主观指标证明了卓越的性能。音频示例可在\url{https://cypress-yang.github.io/SongEditor_demo/}中找到。
摘要:The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics. Audio samples are available in \url{https://cypress-yang.github.io/SongEditor_demo/}.

【9】 Deep Speech Synthesis from Multimodal Articulatory Representations
标题:从多模式关节表达中进行深度语音合成
链接:https://arxiv.org/abs/2412.13387
作者:Peter Wu,  Bohan Yu,  Kevin Scheck,  Alan W Black,  Aditi S. Krishnapriyan,  Irene Y. Chen,  Tanja Schultz,  Shinji Watanabe,  Gopala K. Anumanchipalli
摘要:与声学语音数据相比,可用于训练深度学习模型的发音数据量要少得多。为了提高这些低资源环境中的发音到声学合成性能,我们提出了一个多模态预训练框架。在从实时磁共振成像和表面肌电图输入的单扬声器语音合成任务中,合成输出的可懂度显著提高。例如,与以前的工作相比,利用我们提出的迁移学习方法将MRI到语音的性能提高了36%的单词错误率。除了这些可懂度结果外,我们的多模态预训练模型在三个客观和主观合成质量指标上始终优于单峰基线。
摘要:The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

eess.AS音频处理

【1】 Investigating the Effects of Diffusion-based Conditional Generative  Speech Models Used for Speech Enhancement on Dysarthric Speech
标题:研究用于语音增强的基于扩散的条件生成语音模型对发音障碍语音的影响
链接:https://arxiv.org/abs/2412.13933
作者:Joanna Reszka,  Parvaneh Janbakhshi,  Tilak Purohit,  Sadegh Mohammadi
备注:Accepted at ICASSP 2025 Satellite Workshop: Workshop on Speech Pathology Analysis and DEtection (SPADE)
摘要:在这项研究中,我们的目标是探索预训练的条件生成语音模型的效果,第一次对构音障碍的语音由于帕金森氏病记录在一个理想的/无噪声的条件。考虑一类生成模型,即,基于扩散的语音增强,这些模型预先被训练以学习干净(即,在无噪声环境中记录的)典型语音信号的分布。因此,我们假设,当被暴露在构音障碍的讲话,他们可能会删除看不见的非典型的非语言学线索在增强过程中。通过考虑自动构音障碍语音检测任务,在这项研究中,我们的实验表明,在增强过程中的构音障碍语音数据记录在一个理想的非噪声环境中,一些声学构音障碍语音线索丢失。因此,这种预先训练的模型还不适合构音障碍语音增强的上下文中,因为它们在处理干净的构音障碍语音时操纵病理语音线索。此外,我们表明,删除的声学线索的增强模型的形式残留的语音信号可以提供互补的构音障碍的线索时,与原始的输入语音信号在特征空间中融合。
摘要:In this study, we aim to explore the effect of pre-trained conditional generative speech models for the first time on dysarthric speech due to Parkinson's disease recorded in an ideal/non-noisy condition. Considering one category of generative models, i.e., diffusion-based speech enhancement, these models are previously trained to learn the distribution of clean (i.e, recorded in a noise-free environment) typical speech signals. Therefore, we hypothesized that when being exposed to dysarthric speech they might remove the unseen atypical paralinguistic cues during the enhancement process. By considering the automatic dysarthric speech detection task, in this study, we experimentally show that during the enhancement process of dysarthric speech data recorded in an ideal non-noisy environment, some of the acoustic dysarthric speech cues are lost. Therefore such pre-trained models are not yet suitable in the context of dysarthric speech enhancement since they manipulate the pathological speech cues when they process clean dysarthric speech. Furthermore, we show that the removed acoustics cues by the enhancement models in the form of residue speech signal can provide complementary dysarthric cues when fused with the original input speech signal in the feature space.

【2】 Speech Watermarking with Discrete Intermediate Representations
标题:具有离散中间表示的语音水印
链接:https://arxiv.org/abs/2412.13917
作者:Shengpeng Ji,  Ziyue Jiang,  Jialong Zuo,  Minghui Fang,  Yifu Chen,  Tao Jin,  Zhou Zhao
备注:Accepted by AAAI 2025
摘要:语音水印技术可以主动减轻即时语音克隆技术的潜在危害。这些技术涉及将信号插入到人类无法感知但可以通过算法检测到的语音中。以前的方法通常将水印消息嵌入到连续空间中。然而,直观地说,将水印信息嵌入到鲁棒的离散潜在空间中可以显著提高水印系统的鲁棒性。在本文中,我们提出了离散WM,一种新的语音水印框架,注入水印到离散的中间表示的语音。具体来说,我们将语音映射到离散的潜在空间与矢量量化的自编码器和注入水印通过改变离散ID的模算术关系。为了保证水印的不可感知性,我们还提出了一个机械手模型来选择水印嵌入的候选标记。实验结果表明,我们的框架实现了国家的最先进的性能在鲁棒性和不可感知性,同时。此外,我们灵活的帧方式可以作为一个有效的解决方案,语音克隆检测和信息隐藏。此外,DiscreteWM可以在1秒的语音片段内编码1到150比特的水印信息,这表明其编码能力。音频样本可在https://DiscreteWM.github.io/discrete_wm上获得。
摘要:Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vector-quantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity. Audio samples are available at https://DiscreteWM.github.io/discrete_wm.

【3】 SongEditor: Adapting Zero-Shot Song Generation Language Model as a  Multi-Task Editor
标题:歌曲编辑器:将Zero-Shot歌曲生成语言模型改编为多任务编辑器
链接:https://arxiv.org/abs/2412.13786
作者:Chenyu Yang,  Shuai Wang,  Hangting Chen,  Jianwei Yu,  Wei Tan,  Rongzhi Gu,  Yaoxun Xu,  Yizhi Zhou,  Haina Zhu,  Haizhou Li
备注:Accepted by AAAI2025
摘要:新的生成建模范式的出现,特别是音频语言模型,大大推进了歌曲生成领域。虽然最先进的模型能够同时合成长达几分钟的人声和伴奏音轨,但对现有歌曲的部分调整或编辑的研究仍然不足,这使得制作更加灵活和有效。在本文中,我们提出了SongEditor,第一首歌曲编辑范例,引入编辑功能的语言建模歌曲生成方法,促进段明智和轨道明智的修改。SongEditor提供了灵活的调整歌词,人声,和配乐,以及从头开始合成歌曲。SongEditor的核心组件包括一个音乐标记器、一个自回归语言模型和一个扩散生成器,可以生成整个部分、掩蔽的歌词,甚至是分离的人声和背景音乐。大量的实验表明,所提出的SongEditor在端到端的歌曲编辑,客观和主观指标证明了卓越的性能。音频示例可在\url{https://cypress-yang.github.io/SongEditor_demo/}中找到。
摘要:The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics. Audio samples are available in \url{https://cypress-yang.github.io/SongEditor_demo/}.

【4】 Deep Speech Synthesis from Multimodal Articulatory Representations
标题:从多模式关节表达中进行深度语音合成
链接:https://arxiv.org/abs/2412.13387
作者:Peter Wu,  Bohan Yu,  Kevin Scheck,  Alan W Black,  Aditi S. Krishnapriyan,  Irene Y. Chen,  Tanja Schultz,  Shinji Watanabe,  Gopala K. Anumanchipalli
摘要:与声学语音数据相比,可用于训练深度学习模型的发音数据量要少得多。为了提高这些低资源环境中的发音到声学合成性能,我们提出了一个多模态预训练框架。在从实时磁共振成像和表面肌电图输入的单扬声器语音合成任务中,合成输出的可懂度显著提高。例如,与以前的工作相比,利用我们提出的迁移学习方法将MRI到语音的性能提高了36%的单词错误率。除了这些可懂度结果外,我们的多模态预训练模型在三个客观和主观合成质量指标上始终优于单峰基线。
摘要:The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

【5】 NeckCare: Preventing Tech Neck using Hearable-based Multimodal Sensing
标题:NeckCare:使用基于听觉的多模式传感来预防技术瓶颈
链接:https://arxiv.org/abs/2412.13579
作者:Bhawana Chhaglani,  Alan Seefeldt
摘要:科技颈是一种由长时间使用设备引起的现代流行病,它会导致严重的颈部拉伤和不适。本文讨论了使用非侵入性无处不在的传感技术检测和预防科技颈综合征的挑战。我们介绍了NeckCare,一种利用听觉传感器(包括IMU和麦克风)的新型系统,用于实时监测技术颈部姿势并估计屏幕距离。通过分析来自15名参与者的音高,位移和声学测距数据,我们单独使用IMU数据实现了96%的姿势分类准确率,结合音频数据时达到了99%。我们的距离估计技术即使在嘈杂的条件下也能达到毫米级的精度。NeckCare为用户提供即时反馈,促进更健康的姿势,减少颈部紧张。未来的工作将探索个性化警报,预测肌肉紧张,整合颈部运动检测和增强数字眼疲劳预测。
摘要:Tech neck is a modern epidemic caused by prolonged device usage and it can lead to significant neck strain and discomfort. This paper addresses the challenge of detecting and preventing tech neck syndrome using non-invasive ubiquitous sensing techniques. We present NeckCare, a novel system leveraging hearable sensors, including IMUs and microphones, to monitor tech neck postures and estimate distance form screen in real-time. By analyzing pitch, displacement, and acoustic ranging data from 15 participants, we achieve posture classification accuracy of 96% using IMU data alone and 99% when combined with audio data. Our distance estimation technique is millimeter-level accurate even in noisy conditions. NeckCare provides immediate feedback to users, promoting healthier posture and reducing neck strain. Future work will explore personalizing alerts, predicting muscle strain, integrating neck exercise detection and enhancing digital eye strain prediction.

【6】 Tuning Music Education: AI-Powered Personalization in Learning Music
标题:调整音乐教育:音乐学习中的人工智能个性化
链接:https://arxiv.org/abs/2412.13514
作者:Mayank Sanganeria,  Rohan Gala
备注:38th Conference on Neural Information Processing Systems (NeurIPS 2024) Creative AI Track
摘要:最近,人工智能驱动的阶跃函数在音乐技术的几个长期问题上取得了进展,为创建下一代音乐教育工具开辟了新的途径。创造个性化,引人入胜和有效的学习体验是音乐教育中不断发展的挑战。在这里,我们提出了两个案例研究,利用音乐技术的进步,以应对这些挑战。在我们的第一个案例研究中,我们展示了一个应用程序,该应用程序使用自动和弦识别从音轨中生成个性化练习,将传统的耳朵训练与真实世界的音乐环境相结合。在第二个案例研究中,我们制作了自适应钢琴方法书的原型,这些书使用自动音乐转录来生成不同技能水平的练习,同时保持与音乐兴趣的密切联系。这些应用展示了最近的人工智能发展如何使高质量的音乐教育民主化,并在生成人工智能时代促进与音乐的丰富互动。我们希望这项工作能激励社区的其他努力,旨在消除获得高质量音乐教育的障碍,促进人类参与音乐表达。
摘要:Recent AI-driven step-function advances in several longstanding problems in music technology are opening up new avenues to create the next generation of music education tools. Creating personalized, engaging, and effective learning experiences are continuously evolving challenges in music education. Here we present two case studies using such advances in music technology to address these challenges. In our first case study we showcase an application that uses Automatic Chord Recognition to generate personalized exercises from audio tracks, connecting traditional ear training with real-world musical contexts. In the second case study we prototype adaptive piano method books that use Automatic Music Transcription to generate exercises at different skill levels while retaining a close connection to musical interests. These applications demonstrate how recent AI developments can democratize access to high-quality music education and promote rich interaction with music in the age of generative AI. We hope this work inspires other efforts in the community, aimed at removing barriers to access to high-quality music education and fostering human participation in musical expression.

【7】 SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation
标题:SAVGBBench:空间对齐的音频视频生成基准
链接:https://arxiv.org/abs/2412.13462
作者:Kazuki Shimada,  Christian Simon,  Takashi Shibuya,  Shusuke Takahashi,  Yuki Mitsufuji
备注:5 pages, 3 figures
摘要:这项工作解决了缺乏能够生成具有空间对齐音频的高质量视频的多模式生成模型的问题。虽然生成模型的最新进展在视频生成方面取得了成功,但它们往往忽略了音频和视觉之间的空间对齐,这对于沉浸式体验至关重要。为了解决这个问题,我们建立了一个新的研究方向,在基准空间对齐的音频视频生成(SAVG)。我们提出了基准测试的三个关键组成部分:数据集,基线和指标。我们介绍了一个空间对齐的视听数据集,来自视听数据集组成的多通道音频,视频和时空注释的声音事件。我们提出了一个基线视听扩散模型,专注于立体声视听联合学习,以适应空间声音。最后,我们提出了评估视频和空间音频质量的指标,包括一个新的空间视听对齐指标。我们的实验结果表明,在视频和音频质量以及两种模式之间的空间对齐方面,基线模型和地面实况之间存在差距。
摘要:This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking Spatially Aligned Audio-Video Generation (SAVG). We propose three key components for the benchmark: dataset, baseline, and metrics. We introduce a spatially aligned audio-visual dataset, derived from an audio-visual dataset consisting of multichannel audio, video, and spatiotemporal annotations of sound events. We propose a baseline audio-visual diffusion model focused on stereo audio-visual joint learning to accommodate spatial sound. Finally, we present metrics to evaluate video and spatial audio quality, including a new spatial audio-visual alignment metric. Our experimental result demonstrates that gaps exist between the baseline model and ground truth in terms of video and audio quality, and spatial alignment between both modalities.

【8】 Detecting Machine-Generated Music with Explainability -- A Challenge and  Early Benchmarks
标题:具有可解释性检测机器生成的音乐--挑战和早期基准
链接:https://arxiv.org/abs/2412.13421
作者:Yupei Li,  Qiyang Sun,  Hanqian Li,  Lucia Specia,  Björn W. Schuller
摘要:机器生成音乐(MGM)已成为一项突破性的创新,具有广泛的应用,如音乐治疗,个性化编辑和音乐行业的创意灵感。然而,米高梅的不受管制的扩散对娱乐,教育和艺术部门提出了相当大的挑战,可能会破坏高质量人类作品的价值。因此,MGM检测(MGMD)对于保持这些字段的完整性至关重要。尽管其重要性,MGMD领域缺乏推动有意义的进展所必需的全面基准结果。为了解决这一差距,我们使用一系列音频处理的基础模型对现有的大规模数据集进行实验,建立针对MGMD任务的基准结果。我们的选择包括传统的机器学习模型、深度神经网络、基于Transformer的架构和状态空间模型(SSM)。认识到音乐固有的多模态性质,它集成了旋律和歌词,我们还探讨了基本的多模态模型在我们的实验。除了提供基本的二进制分类结果外,我们还使用多种可解释的人工智能(XAI)工具深入研究模型行为,深入了解其决策过程。我们的分析表明,ResNet18在域内和域外测试中表现最好。通过提供基准测试结果及其可解释性的全面比较,我们提出了几个方向,以启发未来的研究开发更强大和有效的检测方法MGM。
摘要:Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions. Consequently, MGM detection (MGMD) is crucial for preserving the integrity of these fields. Despite its significance, MGMD domain lacks comprehensive benchmark results necessary to drive meaningful progress. To address this gap, we conduct experiments on existing large-scale datasets using a range of foundational models for audio processing, establishing benchmark results tailored to the MGMD task. Our selection includes traditional machine learning models, deep neural networks, Transformer-based architectures, and State Space Models (SSM). Recognising the inherently multimodal nature of music, which integrates both melody and lyrics, we also explore fundamental multimodal models in our experiments. Beyond providing basic binary classification outcomes, we delve deeper into model behaviour using multiple explainable Aritificial Intelligence (XAI) tools, offering insights into their decision-making processes. Our analysis reveals that ResNet18 performs the best according to in-domain and out-of-domain tests. By providing a comprehensive comparison of benchmark results and their interpretability, we propose several directions to inspire future research to develop more robust and effective detection methods for MGM.

【9】 Synthetic Speech Classification: IEEE Signal Processing Cup 2022  challenge
标题:合成语音分类:2022年IEEE信号处理杯挑战
链接:https://arxiv.org/abs/2412.13279
作者:Mahieyin Rahmun,  Rafat Hasan Khan,  Tanjim Taharat Aurpa,  Sadia Khan,  Zulker Nayeen Nahiyan,  Mir Sayad Bin Almas,  Rakibul Hasan Rajib,  Syeda Sakira Hassan
摘要:该项目的目的是为IEEE Signal ProcessingCup 2022挑战赛实现和设计鲁棒的合成语音分类器。在这里,我们学习一个合成语音attributationmodel使用语音生成的各种文本到语音(TTS)算法以及未知的TTS算法。我们实验了经典的机器学习方法,如支持向量机,高斯混合模型,以及基于深度学习的方法,如ResNet,VGG16和两个浅端到端网络。我们观察到,基于深度学习的方法与原始数据表现出最佳的性能。
摘要:The aim of this project is to implement and design arobust synthetic speech classifier for the IEEE Signal ProcessingCup 2022 challenge. Here, we learn a synthetic speech attributionmodel using the speech generated from various text-to-speech(TTS) algorithms as well as unknown TTS algorithms. Weexperiment with both the classical machine learning methodssuch as support vector machine, Gaussian mixture model, anddeep learning based methods such as ResNet, VGG16, and twoshallow end-to-end networks. We observe that deep learningbased methods with raw data demonstrate the best performance.


机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区

语音之家
助力AI语音开发者的社区
 最新文章