语音/音频处理学术速递[10.29]

文摘   2024-10-29 18:02   北京  
今日论文合集:cs.SD语音19篇,eess.AS音频处理22篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音

【1】 GPT-4o System Card

标题:GPT-4 o系统卡
链接:https://arxiv.org/abs/2410.21276
作者:OpenAI:  Aaron Hurst,  Adam Lerer,  Adam P. Goucher,  Adam Perelman,  Aditya Ramesh,  Aidan Clark,  AJ Ostrow,  Akila Welihinda,  Alan Hayes,  Alec Radford,  Aleksander Mądry,  Alex Baker-Whitcomb,  Alex Beutel,  Alex Borzunov,  Alex Carney,  Alex Chow,  Alex Kirillov,  Alex Nichol,  Alex Paino,  Alex Renzin,  Alex Tachard Passos,  Alexander Kirillov,  Alexi Christakis,  Alexis Conneau,  Ali Kamali,  Allan Jabri,  Allison Moyer,  Allison Tam,  Amadou Crookes,  Amin Tootoochian,  Amin Tootoonchian,  Ananya Kumar,  Andrea Vallone,  Andrej Karpathy,  Andrew Braunstein,  Andrew Cann,  Andrew Codispoti,  Andrew Galu,  Andrew Kondrich,  Andrew Tulloch,  Andrey Mishchenko,  Angela Baek,  Angela Jiang,  Antoine Pelisse,  Antonia Woodford,  Anuj Gosalia,  Arka Dhar,  Ashley Pantuliano,  Avi Nayak,  Avital Oliver,  Barret Zoph,  Behrooz Ghorbani,  et al. (365 additional authors not shown)
摘要:GPT-4 o是一个自回归全方位模型,它接受文本、音频、图像和视频的任何组合作为输入,并生成文本、音频和图像输出的任何组合。它是跨文本、视觉和音频进行端到端训练的,这意味着所有输入和输出都由同一个神经网络处理。GPT-4 o可以在短至232毫秒的时间内对音频输入做出响应,平均为320毫秒,这与人类在对话中的响应时间相似。它在英文文本和代码上与GPT-4 Turbo性能相匹配,在非英文语言中的文本上有显着改进,同时在API中也更快,便宜50%。与现有型号相比,GPT-4 o在视觉和音频理解方面尤其出色。根据我们对安全构建人工智能的承诺,以及我们对白宫的自愿承诺,我们正在分享GPT-4 o系统卡,其中包括我们的准备框架评估。在本系统卡中,我们详细介绍了GPT-4 o的功能、限制和多个类别的安全评估,重点关注语音到语音,同时还评估了文本和图像功能,以及我们为确保模型安全和一致而实施的措施。我们还包括对危险能力的第三方评估,以及对GPT-4 o的文本和视觉能力的潜在社会影响的讨论。
摘要:GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

【2】 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
标题:OmniSep:统一的全情态声音分离和查询混音
链接:https://arxiv.org/abs/2410.21269
作者:Xize Cheng,  Siqi Zheng,  Zehan Wang,  Minghui Fang,  Ziang Zhang,  Rongjie Huang,  Ziyang Ma,  Shengpeng Ji,  Jialong Zuo,  Tao Jin,  Zhou Zhao
备注:Working in progress
摘要:近年来,规模的扩大在视觉和语言领域取得了巨大的成功。然而,当涉及到音频时,研究人员在扩大训练数据方面遇到了一个重大挑战,因为大多数自然音频都包含不同的干扰信号。为了解决这个问题,我们引入了全模态声音分离(OmniSep),一种新的框架,能够隔离干净的音轨的基础上全模态查询,包括单模态和多模态组合查询。具体来说,我们引入了Query-Mixup策略,该策略在训练过程中混合了来自不同模态的查询特征。这使得OmniSep能够同时优化多个模态,有效地将所有模态置于统一的声音分离框架下。我们进一步提高了这种灵活性,允许查询影响声音分离积极或消极,促进保留或删除特定的声音所需的。最后,OmniSep采用了一种名为Query-Aug的检索增强方法,该方法可以实现开放词汇的声音分离。在MUSIC、VGGSOUND-CLEAN+和MUSIC-CLEAN+数据集上的实验评估证明了OmniSep的有效性,在文本、图像和音频查询的声音分离任务中实现了最先进的性能。有关示例和更多信息,请访问演示页面:\url{https://omnisep.github.io/}。
摘要:The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}.

【3】 ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time  Optimization
标题:ST-TO:通过推理时间优化控制音频效果以实现风格转移
链接:https://arxiv.org/abs/2410.21233
作者:Christian J. Steinmetz,  Shubhr Singh,  Marco Comunità,  Ilias Ibnyahya,  Shanxin Yuan,  Emmanouil Benetos,  Joshua D. Reiss
备注:Accepted to ISMIR 2024. Code available this https URL
摘要:音频制作风格转换是处理输入以从参考录音中传递风格元素的任务。现有的方法通常训练神经网络来估计一组音频效果的控制参数。然而,这些方法是有限的,因为它们只能控制一组固定的效果,其中效果必须是可区分的或以其他方式采用专门的训练技术。在这项工作中,我们介绍了ST-ITO,推理时间优化的风格转移,一种在推理时搜索音频效果链的参数空间的方法。该方法能够控制任意音频效果链,包括不可见和不可区分的效果。我们的方法采用了音频制作风格的学习指标,我们通过简单且可扩展的自监督预训练策略以及无梯度优化器进行训练。针对现有的音频制作风格迁移评估方法的局限性,本文提出了一种多部分的音频制作风格迁移评估方法。该评估表明,我们的音频表示更好地捕捉与音频制作相关的属性,并通过控制任意音频效果实现表达风格的转移。
摘要:Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects.

【4】 SepMamba: State-space models for speaker separation using Mamba
标题:SepMamba:使用Mamba进行扬声器分离的状态空间模型
链接:https://arxiv.org/abs/2410.20997
作者:Thor Højhus Avenstrup,  Boldizsár Elek,  István László Mádi,  András Bence Schin,  Morten Mørup,  Bjørn Sand Jensen,  Kenny Falkær Olsen
摘要:近年来,基于深度学习的单通道说话人分离技术得到了显著改善,这主要归功于基于transformer的注意力机制的引入。然而,这些改进是以密集的计算需求为代价的,从而排除了它们在许多实际应用中的使用。作为一种具有类似建模功能的计算效率高的替代方案,Mamba最近被引入。我们提出SepMamba,一个基于U-Net的架构,主要由双向Mamba层组成。我们发现,我们的方法在WSJ 0 2-speaker数据集上的性能优于类似大小的突出模型(包括基于transformer的模型),同时显著降低了计算成本,内存使用量和前向传递时间。我们还报告了SepMamba的因果变体的强有力结果。我们的方法提供了一个计算上有利的替代基于变换器的架构进行深度语音分离。
摘要:Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

【5】 Atrial Fibrillation Detection System via Acoustic Sensing for Mobile  Phones
标题:通过声学传感的手机心房颤动检测系统
链接:https://arxiv.org/abs/2410.20852
作者:Xuanyu Liu,  Jiao Li,  Haoxian Liu,  Zongqi Yang,  Yi Huang,  Jin Zhang
备注:This paper has been submitted to ACM Transactions on Sensor Networks (TOSN)
摘要:心房颤动(AF)的特征是起源于心房的不规则电脉冲,其可导致严重的并发症甚至死亡。由于AF的间歇性,早期和及时监测AF对于患者预防病情进一步恶化至关重要。尽管动态心电动态心电图监测仪可以提供准确的监测,但这些设备的高成本阻碍了它们的广泛采用。当前基于移动的AF检测系统提供便携式解决方案,然而,这些系统具有各种适用性问题,诸如容易受环境因素影响并且需要大量用户努力。为了克服上述限制,我们提出了MobileAF,一种使用扬声器和麦克风的新型智能手机AF检测系统。为了捕捉微小的心脏活动,我们提出了一种多通道脉搏波探测方法。此外,我们通过引入三级脉冲波净化管道来提高信号质量。更重要的是,建立了一个基于ResNet的网络模型,以实现准确可靠的AF检测。我们使用智能手机上的数据收集应用程序收集了23名参与者的数据。大量的实验结果表明,我们的系统具有优越的性能,97.9%的准确率,96.8%的精度,97.2%的召回率,98.3%的特异性,和97.0%F1评分。
摘要:Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution, however, these systems have various applicability issues such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present MobileAF, a novel smartphone-based AF detection system using speakers and microphones. In order to capture minute cardiac activities, we propose a multi-channel pulse wave probing method. In addition, we enhance the signal quality by introducing a three-stage pulse wave purification pipeline. What's more, a ResNet-based network model is built to implement accurate and reliable AF detection. We collect data from 23 participants utilizing our data collection application on the smartphone. Extensive experimental results demonstrate the superior performance of our system, with 97.9% accuracy, 96.8% precision, 97.2% recall, 98.3% specificity, and 97.0% F1 score.

【6】 Data-Efficient Low-Complexity Acoustic Scene Classification via  Distilling and Progressive Pruning
标题:通过提取和渐进修剪实现数据高效、低复杂度的声学场景分类
链接:https://arxiv.org/abs/2410.20775
作者:Bing Han,  Wen Huang,  Zhengyang Chen,  Anbai Jiang,  Pingyi Fan,  Cheng Lu,  Zhiqiang Lv,  Jia Liu,  Wei-Qiang Zhang,  Yanmin Qian
备注:submitted to ICASSP 2025
摘要:声学场景分类(ASC)任务的目标是将录音分类到预定义的声学场景类中的一个。然而,在现实世界的场景中,ASC系统经常遇到的挑战,如记录设备不匹配,低复杂性的约束,以及有限的可用性标记的数据。为了缓解这些问题,本文建立了一个数据高效和低复杂度的ASC系统,具有新的模型架构和更好的训练策略。具体来说,我们首先设计了一个新的低复杂度的架构命名为Rep-Mobile集成多卷积分支,可以重新参数化推理。与其他模型相比,它具有更好的性能和更低的计算复杂度。然后我们应用知识蒸馏策略并比较不同架构的教师模型的数据效率。最后,我们提出了一个渐进的修剪策略,它涉及修剪模型多次少量,导致更好的性能相比,一个单一的步骤修剪。在TAU数据集上进行了实验。通过Rep-Mobile和这些培训策略,我们提出的ASC系统实现了迄今为止最先进的(SOTA)结果,同时在DCASE 2024挑战赛中以明显的优势赢得了第一名。
摘要:The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.

【7】 An Ensemble Approach to Music Source Separation: A Comparative Analysis  of Conventional and Hierarchical Stem Separation
标题:音乐来源分离的整体方法:传统和分层音乐茎分离的比较分析
链接:https://arxiv.org/abs/2410.20773
作者:Saarth Vardhan,  Pavani R Acharya,  Samarth S Rao,  Oorjitha Ratna Jasthi,  S Natarajan
摘要:音乐源分离(MSS)是一项涉及从混合音频信号中分离单个声源或干的任务。本文提出了一种MSS的集成方法,结合了几种最先进的架构,以实现卓越的分离性能,在传统的声乐,鼓,低音(VDB)干,以及扩展到第二级分层分离的子干,如踢,圈套,主唱,和背景人声。我们的方法通过利用各种模型的互补优势来解决依赖于单一模型的局限性,从而在各个茎中获得更平衡的结果。对于词干选择,我们使用了信噪比(SNR)和信号失真比(SDR)的调和平均值,确保极值不会扭曲结果,并且两个指标都得到了有效的加权。除了在VDB词干中保持一贯的高性能外,我们还探索了第二级层次分离,揭示了对MSS复杂性的重要见解,以及流派和乐器等因素如何影响模型性能。虽然二级分离结果显示出改进的空间,但分离子股骨柄的能力标志着一个重大进步。我们的研究结果为MSS的进一步研究铺平了道路,特别是在扩展VDB之外的模型功能和改善吉他和钢琴等利基茎分离方面。
摘要:Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance across traditional Vocal, Drum, and Bass (VDB) stems, as well as expanding into second-level hierarchical separation for sub-stems like kick, snare, lead vocals, and background vocals. Our method addresses the limitations of relying on a single model by utilising the complementary strengths of various models, leading to more balanced results across stems. For stem selection, we used the harmonic mean of Signal-to-Noise Ratio (SNR) and Signal-to-Distortion Ratio (SDR), ensuring that extreme values do not skew the results and that both metrics are weighted effectively. In addition to consistently high performance across the VDB stems, we also explored second-level hierarchical separation, revealing important insights into the complexities of MSS and how factors like genre and instrumentation can influence model performance. While the second-level separation results show room for improvement, the ability to isolate sub-stems marks a significant advancement. Our findings pave the way for further research in MSS, particularly in expanding model capabilities beyond VDB and improving niche stem separations such as guitar and piano.

【8】 Mitigating Unauthorized Speech Synthesis for Voice Protection
标题:缓解未经授权的语音合成以实现语音保护
链接:https://arxiv.org/abs/2410.20742
作者:Zhisheng Zhang,  Qianyi Yang,  Derui Wang,  Pengyang Huang,  Yuxin Cao,  Kai Ye,  Jie Hao
备注:Accepted to ACM CCS Workshop (LAMPS) 2024
摘要:近年来,只需几个语音样本就可以完美地复制说话者的声音,而恶意语音利用(例如,电信诈骗(以获取非法经济利益)给我们的日常生活带来了巨大的危害。因此,保护包含敏感信息(例如个人声纹)的可公开访问的语音数据至关重要。大多数以前的防御方法都集中在音色相似性上欺骗说话人验证系统,但合成的deepfake语音仍然具有高质量。为了应对不断增加的风险,我们设计了一种有效、可转移且强大的主动保护技术,称为“目标扰动”(POP),该技术对原始语音样本应用不可感知的误差最小化噪声,以防止它们被有效地学习用于文本到语音(TTS)合成模型,从而无法生成高质量的deepfake语音。我们进行了广泛的实验,国家的最先进的(SOTA)TTS模型,利用客观和主观的指标,全面评估我们提出的方法。实验结果表明,在各种模型之间的出色的有效性和可移植性。与在没有保护的样本上训练的语音合成器的语音不清晰度分数21.94%相比,POP保护的样本将其显著提高到127.31%。此外,我们的方法显示出对降噪和数据增强技术的鲁棒性,从而大大降低了潜在的危险。
摘要:With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.

【9】 Using Confidence Scores to Improve Eyes-free Detection of Speech  Recognition Errors
标题:使用置信度分数改进语音识别错误的无眼检测
链接:https://arxiv.org/abs/2410.20564
作者:Sadia Nowrin,  Keith Vertanen
摘要:会话系统严重依赖语音识别来解释和响应用户命令和查询。然而,识别错误可能会发生,这可能会显着影响这种系统的性能。虽然视觉反馈可以帮助检测错误,但它可能并不总是实用的,特别是对于盲人或低视力的人。在这项研究中,我们探讨如何提高错误检测的处理的音频输出的转录文本的基础上,识别器的置信水平在其结果。我们的研究结果表明,当识别器表现出不确定性时,选择性地减慢音频,与均匀减慢音频相比,参与者的错误检测能力相对增加了12%。
摘要:Conversational systems rely heavily on speech recognition to interpret and respond to user commands and queries. Nevertheless, recognition errors may occur, which can significantly affect the performance of such systems. While visual feedback can help detect errors, it may not always be practical, especially for people who are blind or low-vision. In this study, we investigate ways to improve error detection by manipulating the audio output of the transcribed text based on the recognizer's confidence level in its result. Our findings show that selectively slowing down the audio when the recognizer exhibited uncertainty led to a relative increase of 12% in participants' error detection ability compared to uniformly slowing down the audio.

【10】 Automatic Estimation of Singing Voice Musical Dynamics
标题:歌声音乐动态的自动估计
链接:https://arxiv.org/abs/2410.20540
作者:Jyoti Narang,  Nazif Can Tamer,  Viviana De La Vega,  Xavier Serra
备注:To be published in ISMIR 2024, 6 pages
摘要:音乐的力度构成了富有表现力的歌唱声音表演的核心部分。然而,歌唱声音的音乐动力学的自动分析受到了有限的关注,部分原因是缺乏合适的数据集和缺乏明确的评估框架。为了应对这一挑战,我们提出了一种数据集策展的方法。采用所提出的方法,我们编译了一个数据集,包括509音乐动态注释的歌声表演,与163个分数文件,利用国家的最先进的源分离和对齐技术。这些乐谱来自OpenScore Lieder的浪漫主义时期作品语料库,以其丰富的表达性注释而闻名。利用策划的数据集,我们训练一个具有不同窗口大小的基于多头注意力的CNN模型,以评估估计音乐动态的有效性。我们探索了两种不同的感知动机的输入表示模型训练:对数梅尔频谱和树皮规模为基础的功能。为了进行测试,我们与专业歌手合作,手动策划了另一个包含25个音乐动态注释表演的数据集。我们的结论是,通过我们的实验,树皮规模为基础的功能优于对数梅尔功能的歌唱声音动态预测的任务。数据集与代码一起公开共享,以进一步研究该主题。
摘要:Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.

【11】 MidiTok Visualizer: a tool for visualization and analysis of tokenized  MIDI symbolic music
标题:MidiTok可视化器:一种用于可视化和分析代币化的MIDI符号音乐的工具
链接:https://arxiv.org/abs/2410.20518
作者:Michał Wiszenko,  Kacper Stefański,  Piotr Malesa,  Łukasz Pokorzyński,  Mateusz Modrzejewski
备注:in Extended Abstracts for the Late-Breaking Demo Sessionof the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024
摘要:符号音乐研究在与音乐相关的机器学习中起着至关重要的作用,但对于那些没有音乐专业知识的人来说,符号音乐数据可能很复杂。为了解决这个问题,我们提出了MidiTok Visualizer,这是一个Web应用程序,旨在促进对MidiTok Python包中各种标记化方法的探索和可视化。MidiTok Visualizer提供了许多可自定义的参数,使用户能够上传XML文件,以可视化标记化的数据以及交互式钢琴卷。
摘要:Symbolic music research plays a crucial role in music-related machine learning, but MIDI data can be complex for those without musical expertise. To address this issue, we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MIDI tokenization methods from the MidiTok Python package. MidiTok Visualizer offers numerous customizable parameters, enabling users to upload MIDI files to visualize tokenized data alongside an interactive piano roll.

【12】 Symbotunes: unified hub for symbolic music generative models
标题:Symbotunes:符号音乐生成模型的统一中心
链接:https://arxiv.org/abs/2410.20515
作者:Paweł Skierś,  Maksymilian Łazarski,  Michał Kopeć,  Mateusz Modrzejewski
摘要:流行的符号音乐生成模型的实现通常在所使用的库和整体项目结构方面存在显着差异。因此,直接比较这些方法或熟悉它们可能会带来挑战。为了缓解这个问题,我们引入了Symbotunes,这是一个用于符号音乐生成模型的开源统一中心。Symbotunes包含用于符号音乐生成的著名方法的现代Python实现,以及用于生成和训练的统一管道。
摘要:Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.

【13】 MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
标题:MusicFlow:用于文本引导音乐生成的级联流匹配
链接:https://arxiv.org/abs/2410.20478
作者:K R Prajwal,  Bowen Shi,  Matthew Lee,  Apoorv Vyas,  Andros Tjandra,  Mahi Luthra,  Baishan Guo,  Huiyu Wang,  Triantafyllos Afouras,  David Kant,  Wei-Ning Hsu
备注:ICML 2024
摘要:我们介绍了MusicFlow,一个基于流匹配的级联文本到音乐生成模型。基于自监督表示的文本描述和音乐音频之间的桥梁,我们构建了两个流匹配网络的语义和声学特征的条件分布模型。此外,我们利用掩蔽预测作为训练目标,使模型能够以zero-shot方式推广到其他任务,如音乐填充和延续。MusicCaps上的实验表明,MusicFlow生成的音乐具有卓越的质量和文本一致性,尽管比2\sim5 $倍小,需要的迭代步骤少5 $倍。同时,该模型可以执行其他音乐生成任务,并在音乐填充和延续方面取得了竞争性的性能。我们的代码和模型将公开提供。
摘要:We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

【14】 Conditional GAN for Enhancing Diffusion Models in Efficient and  Authentic Global Gesture Generation from Audios
标题:用于增强Audios高效且真实的全球手势生成中的扩散模型的条件GAN
链接:https://arxiv.org/abs/2410.20359
作者:Yongkang Cheng,  Mingjiang Liang,  Shaoli Huang,  Gaoge Han,  Jifeng Ning,  Wei Liu
备注:Accepted by WACV 2025 (Round 1)
摘要:音频驱动的同步手势生成对于人机通信、AI游戏和电影制作至关重要。虽然以前的研究已经显示出希望,但仍然存在局限性。基于VAE的方法伴随着局部抖动和全局不稳定性的问题,而基于扩散模型的方法受到低生成效率的阻碍。这是因为后者中的DDPM的去噪过程依赖于这样的假设,即在每一步添加的噪声是从单峰分布中采样的,并且噪声值很小。DDIM借用了欧拉方法求解微分方程的思想,打乱了马尔可夫链过程,并增加了噪声步长,以减少去噪步骤的数量,从而加速生成。然而,在逐步去噪过程中简单地增加步长会导致结果逐渐偏离原始数据分布,导致生成的动作质量显著下降,并出现不自然的伪影。本文突破了DDPM的假设条件,在去噪速度和保真度方面取得了突破性进展。具体来说,我们引入了一个条件GAN来捕获音频控制信号,并在同一采样步骤内隐式地匹配扩散和去噪步骤之间的多模态去噪分布,旨在采样更大的噪声值并应用更少的去噪步骤以实现高速生成。
摘要:Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation.

【15】 An approach to hummed-tune and song sequences matching
标题:哼唱和歌曲序列匹配的方法
链接:https://arxiv.org/abs/2410.20352
作者:Loc Bao Pham,  Huong Hoang Luong,  Phu Thien Tran,  Phuc Hoang Ngo,  Vi Hoang Nguyen,  Thinh Nguyen
备注:None
摘要:旋律卡在你的头上,也被称为“耳虫”,很难摆脱,除非你再听一遍或大声唱出来。但是如果你找不到这首歌的名字呢?这一定是一种无法忍受的感觉。基于哼唱声音识别歌曲名称对于人类来说不是一件容易的事情,应该由机器来完成。然而,目前还没有关于哼音识别的研究论文发表。改编自Hum 2Song Zalo AI Challenge 2021 -一个关于通过用户给出哼唱曲调来查询歌曲名称的比赛,类似于Google的搜索引擎。本文详细介绍了从原始类型(mp3)到可用于训练和推理的形式的预处理数据。在训练特征提取阶段的嵌入模型时,我们使用一些最先进的模型进行了实验,例如ResNet,VGG,AlexNet,MobileNetV 2。在推理阶段,我们使用Faiss模块来有效地搜索与嗡嗡声序列相匹配的歌曲。结果在公共测试集上的MRR@10指标中接近94\%,以及公共排行榜上的前1名结果。
摘要:Melody stuck in your head, also known as "earworm", is tough to get rid of, unless you listen to it again or sing it out loud. But what if you can not find the name of that song? It must be an intolerable feeling. Recognizing a song name base on humming sound is not an easy task for a human being and should be done by machines. However, there is no research paper published about hum tune recognition. Adapting from Hum2Song Zalo AI Challenge 2021 - a competition about querying the name of a song by user's giving humming tune, which is similar to Google's Hum to Search. This paper covers details about the pre-processed data from the original type (mp3) to usable form for training and inference. In training an embedding model for the feature extraction phase, we ran experiments with some states of the art, such as ResNet, VGG, AlexNet, MobileNetV2. And for the inference phase, we use the Faiss module to effectively search for a song that matched the sequence of humming sound. The result comes at nearly 94\% in MRR@10 metric on the public test set, along with the top 1 result on the public leaderboard.

【16】 Get Large Language Models Ready to Speak: A Late-fusion Approach for  Speech Generation
标题:让大型语言模型准备好说话:语音生成的后期融合方法
链接:https://arxiv.org/abs/2410.20336
作者:Maohao Shen,  Shun Zhang,  Jilong Wu,  Zhiping Xiu,  Ehab AlBadawy,  Yiting Lu,  Mike Seltzer,  Qing He
摘要:大型语言模型(LLM)已经彻底改变了自然语言处理(NLP),在各种基于文本的任务中具有令人印象深刻的性能。然而,将文本主导的LLM扩展到语音生成任务仍然没有得到充分的探索。在这项工作中,我们介绍了一个文本到语音(TTS)系统提供动力的微调Llama模型,名为TTS-Llama,实现国家的最先进的语音合成性能。在TTS-Llama的基础上,我们进一步提出了MoLE-Llama,这是一种通过纯后期融合参数有效微调(PEFT)和混合专家架构开发的文本和语音多模态LLM。大量的实证结果表明,MoLE-Llama的竞争力表现在文本的问题回答(QA)和TTS任务,减轻灾难性的遗忘问题,在任何一种方式。最后,我们进一步探讨MoLE-Llama在文本语音输出QA任务,展示了其作为一个多模态对话系统的语音生成能力的巨大潜力。
摘要:Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

【17】 Do Discrete Self-Supervised Representations of Speech Capture Tone  Distinctions?
标题:语音的离散自监督表示能否捕获音调差异?
链接:https://arxiv.org/abs/2410.19935
作者:Opeyemi Osakuade,  Simon King
备注:Submitted to ICASSP 2025
摘要:从自监督学习(SSL)基础模型获得的语音的离散表示被广泛使用,特别是在下游任务的数据有限的情况下,例如对于低资源语言。通常,将语音离散化为符号序列是通过对来自SSL模型的潜伏期进行无监督聚类来实现的。我们的研究评估是否离散符号-发现使用k-均值-充分捕捉音调的两个例子的语言,普通话和约鲁巴语。我们比较潜在的向量与离散符号,从HuBERT基地,MandarinHuBERT,或XLS-R,元音和声调分类。我们发现,使用离散的符号会导致大量的音调信息的损失,即使是语言专门的SSL模型。我们建议,离散化需要任务意识,特别是音调依赖的下游任务。
摘要:Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks.

【18】 Meta-Learning Approaches for Improving Detection of Unseen Speech  Deepfakes
标题:用于改进不可见语音Deepfakes检测的元学习方法
链接:https://arxiv.org/abs/2410.20578
作者:Ivan Kukanov,  Janne Laakkonen,  Tomi Kinnunen,  Ville Hautamäki
备注:6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
摘要:目前的语音深度伪造检测方法对已知的对手表现令人满意;然而,对看不见的攻击的推广仍然是一个开放的挑战。社交媒体上的语音深度伪造的扩散强调了对系统的需求,这些系统可以推广到在训练过程中没有观察到的看不见的攻击。我们从元学习的角度来解决这个问题,旨在学习攻击不变的特征,以适应不可见的攻击,只有很少的样本。这种方法很有前途,因为生成大规模训练数据集通常是昂贵的或不可行的。我们的实验表明,在InTheWild数据集上,使用来自未见过数据集的96个样本,等误差率(EER)从21.67%提高到10.42%。连续的Few-Shot自适应可确保系统保持最新状态。
摘要:Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn attack-invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high-scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few-shot adaptation ensures that the system remains up-to-date.

【19】 Single-word Auditory Attention Decoding Using Deep Learning Model
标题:使用深度学习模型的单字听觉注意力解码
链接:https://arxiv.org/abs/2410.19793
作者:Nhan Duc Thanh Nguyen,  Huy Phan,  Kaare Mikkelsen,  Preben Kidmose
备注:5 pages, 3 figures
摘要:通过比较听觉刺激和相应的大脑反应来识别听觉注意,被称为听觉注意解码(AAD)。大多数AAD算法利用所谓的包络夹带机制,由此通过听觉流的包络如何驱动脑电图(EEG)信号中的变化来识别听觉注意。然而,神经处理也可以基于内源性认知响应来解码,在这种情况下,是由对语音流中的特定单词的注意引起的神经响应。这种方法在AAD领域中很大程度上是未探索的,但是导致单字听觉注意力解码问题,其中将EEG信号定时到特定单词的时期标记为注意或未注意。本文提出了一种基于EEGNet的深度学习方法来应对这一挑战。我们进行了一个主题独立的评估基于事件的AAD数据集与三种不同的范式:词类别oddball,词类别与竞争的扬声器,和竞争的语音流与目标。结果表明,适应模型是能够利用认知相关的时空EEG功能,并实现至少58%的准确性上最现实的竞争范式看不见的主题。据我们所知,这是第一个研究处理这个问题。
摘要:Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing can also be decoded based on endogenous cognitive responses, in this case, neural responses evoked by attention to specific words in a speech stream. This approach is largely unexplored in the field of AAD but leads to a single-word auditory attention decoding problem in which an epoch of an EEG signal timed to a specific word is labeled as attended or unattended. This paper presents a deep learning approach, based on EEGNet, to address this challenge. We conducted a subject-independent evaluation on an event-based AAD dataset with three different paradigms: word category oddball, word category with competing speakers, and competing speech streams with targets. The results demonstrate that the adapted model is capable of exploiting cognitive-related spatiotemporal EEG features and achieving at least 58% accuracy on the most realistic competing paradigm for the unseen subjects. To our knowledge, this is the first study dealing with this problem.

eess.AS音频处理

【1】 Meta-Learning Approaches for Improving Detection of Unseen Speech  Deepfakes
标题:用于改进不可见语音Deepfakes检测的元学习方法
链接:https://arxiv.org/abs/2410.20578
作者:Ivan Kukanov,  Janne Laakkonen,  Tomi Kinnunen,  Ville Hautamäki
备注:6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
摘要:目前的语音深度伪造检测方法对已知的对手表现令人满意;然而,对看不见的攻击的推广仍然是一个开放的挑战。社交媒体上的语音深度伪造的扩散强调了对系统的需求,这些系统可以推广到在训练过程中没有观察到的看不见的攻击。我们从元学习的角度来解决这个问题,旨在学习攻击不变的特征,以适应不可见的攻击,只有很少的样本。这种方法很有前途,因为生成大规模训练数据集通常是昂贵的或不可行的。我们的实验表明,在InTheWild数据集上,使用来自未见过数据集的96个样本,等误差率(EER)从21.67%提高到10.42%。连续的Few-Shot自适应可确保系统保持最新状态。
摘要:Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn attack-invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high-scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few-shot adaptation ensures that the system remains up-to-date.

【2】 Analyzing long-term rhythm variations in Mising and Assamese using  frequency domain correlates
标题:使用频域相关性分析米辛和阿萨姆人的长期节律变化
链接:https://arxiv.org/abs/2410.20095
作者:Parismita Gogoi,  Priyankoo Sarmah,  S. R. M. Prasanna
备注:Submitted to International Journal of Asian Language Processing (IJALP)
摘要:目前的工作探讨了长期的语音节奏变化,以分类米辛和阿萨姆语,两个低资源的语言从阿萨姆邦,印度东北部。本文研究了由幅度(AM)和频率调制(FM)包络提取的低频(LF)频谱图中嵌入的语音节奏的时间信息。这种节奏的定量频域分析得到了最初由Giovanni [1]提出的节奏共振峰分析(RFA)思想的支持。我们试图通过提取来自前六个节奏共振峰的轨迹以及二维离散余弦变换为基础的表征的AM和FM LF频谱图的功能进行调查。衍生的特征作为机器学习工具的输入,以对比阿萨姆语和米辛语的节奏。以这种方式,改进的方法,经验调查的节奏变化结构,没有事先注释的较大单位的语音信号说明了两个低资源的语言,印度东北部。
摘要:The current work explores long-term speech rhythm variations to classify Mising and Assamese, two low-resourced languages from Assam, Northeast India. We study the temporal information of speech rhythm embedded in low-frequency (LF) spectrograms derived from amplitude (AM) and frequency modulation (FM) envelopes. This quantitative frequency domain analysis of rhythm is supported by the idea of rhythm formant analysis (RFA), originally proposed by Gibbon [1]. We attempt to make the investigation by extracting features derived from trajectories of first six rhythm formants along with two-dimensional discrete cosine transform-based characterizations of the AM and FM LF spectrograms. The derived features are fed as input to a machine learning tool to contrast rhythms of Assamese and Mising. In this way, an improved methodology for empirically investigating rhythm variation structure without prior annotation of the larger unit of the speech signal is illustrated for two low-resourced languages of Northeast India.

【3】 Single-word Auditory Attention Decoding Using Deep Learning Model
标题:使用深度学习模型的单字听觉注意力解码
链接:https://arxiv.org/abs/2410.19793
作者:Nhan Duc Thanh Nguyen,  Huy Phan,  Kaare Mikkelsen,  Preben Kidmose
备注:5 pages, 3 figures
摘要:通过比较听觉刺激和相应的大脑反应来识别听觉注意,被称为听觉注意解码(AAD)。大多数AAD算法利用所谓的包络夹带机制,由此通过听觉流的包络如何驱动脑电图(EEG)信号中的变化来识别听觉注意。然而,神经处理也可以基于内源性认知响应来解码,在这种情况下,是由对语音流中的特定单词的注意引起的神经响应。这种方法在AAD领域中很大程度上是未探索的,但是导致单字听觉注意力解码问题,其中将EEG信号定时到特定单词的时期标记为注意或未注意。本文提出了一种基于EEGNet的深度学习方法来应对这一挑战。我们进行了一个主题独立的评估基于事件的AAD数据集与三种不同的范式:词类别oddball,词类别与竞争的扬声器,和竞争的语音流与目标。结果表明,适应模型是能够利用认知相关的时空EEG功能,并实现至少58%的准确性上最现实的竞争范式看不见的主题。据我们所知,这是第一个研究处理这个问题。
摘要:Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing can also be decoded based on endogenous cognitive responses, in this case, neural responses evoked by attention to specific words in a speech stream. This approach is largely unexplored in the field of AAD but leads to a single-word auditory attention decoding problem in which an epoch of an EEG signal timed to a specific word is labeled as attended or unattended. This paper presents a deep learning approach, based on EEGNet, to address this challenge. We conducted a subject-independent evaluation on an event-based AAD dataset with three different paradigms: word category oddball, word category with competing speakers, and competing speech streams with targets. The results demonstrate that the adapted model is capable of exploiting cognitive-related spatiotemporal EEG features and achieving at least 58% accuracy on the most realistic competing paradigm for the unseen subjects. To our knowledge, this is the first study dealing with this problem.

【4】 A Novel Numerical Method for Relaxing the Minimal Configurations of  TOA-Based Joint Sensors and Sources Localization
标题:一种放松基于TPA的关节传感器和源定位最小时间的新数值方法
链接:https://arxiv.org/abs/2410.19772
作者:Faxian Cao,  Yongqiang Cheng,  Adil Mehmood Khan,  Zhijing Yang,  Yingxiu Chang
备注:13 pages, 6 figures
摘要:这项工作介绍了一种新的数值方法,放宽了最低配置要求的联合传感器和源定位(JSSL)在3D空间使用到达时间(TOA)测量。传统上,该原理要求有效方程(TOA测量值)的数量必须等于或大于未知变量(传感器和源位置)的数量。最先进的文献表明,定位所需的传感器和源的最小数量分别为四至六和六至四。然而,这些严格的配置限制了JSSL在传感器和源数量不足的场景中的应用。为了克服这一限制,我们提出了一种数值方法,减少了所需的传感器和源的数量,使更灵活的JSSL配置。首先,我们将JSSL任务制定为一系列三角形,并应用余弦定律来确定与一对传感器和三对源相关联的四个未知距离。接下来,通过利用三角不等式,我们建立了这些未知的下边界和上边界的基础上已知的TOA测量。然后,数值方法在这些边界内搜索以找到全局最优解,证明3D空间中的JSSL仅用四个传感器和四个源就可以实现,从而大大放宽了最低配置要求。理论证明和仿真结果证实了该方法的可行性和有效性。
摘要:This work introduces a novel numerical method that relaxes the minimal configuration requirements for joint sensors and sources localization (JSSL) in 3D space using time of arrival (TOA) measurements. Traditionally, the principle requires that the number of valid equations (TOA measurements) must be equal to or greater than the number of unknown variables (sensor and source locations). State-of-the-art literature suggests that the minimum numbers of sensors and sources needed for localization are four to six and six to four, respectively. However, these stringent configurations limit the application of JSSL in scenarios with an insufficient number of sensors and sources. To overcome this limitation, we propose a numerical method that reduces the required number of sensors and sources, enabling more flexible JSSL configurations. First, we formulate the JSSL task as a series of triangles and apply the law of cosines to determine four unknown distances associated with one pair of sensors and three pairs of sources. Next, by utilizing triangle inequalities, we establish the lower and upper boundaries for these unknowns based on the known TOA measurements. The numerical method then searches within these boundaries to find the global optimal solutions, demonstrating that JSSL in 3D space is achievable with only four sensors and four sources, thus significantly relaxing the minimal configuration requirements. Theoretical proofs and simulation results confirm the feasibility and effectiveness of the proposed method.

【5】 GPT-4o System Card
标题:GPT-4 o系统卡
链接:https://arxiv.org/abs/2410.21276
作者:OpenAI:  Aaron Hurst,  Adam Lerer,  Adam P. Goucher,  Adam Perelman,  Aditya Ramesh,  Aidan Clark,  AJ Ostrow,  Akila Welihinda,  Alan Hayes,  Alec Radford,  Aleksander Mądry,  Alex Baker-Whitcomb,  Alex Beutel,  Alex Borzunov,  Alex Carney,  Alex Chow,  Alex Kirillov,  Alex Nichol,  Alex Paino,  Alex Renzin,  Alex Tachard Passos,  Alexander Kirillov,  Alexi Christakis,  Alexis Conneau,  Ali Kamali,  Allan Jabri,  Allison Moyer,  Allison Tam,  Amadou Crookes,  Amin Tootoochian,  Amin Tootoonchian,  Ananya Kumar,  Andrea Vallone,  Andrej Karpathy,  Andrew Braunstein,  Andrew Cann,  Andrew Codispoti,  Andrew Galu,  Andrew Kondrich,  Andrew Tulloch,  Andrey Mishchenko,  Angela Baek,  Angela Jiang,  Antoine Pelisse,  Antonia Woodford,  Anuj Gosalia,  Arka Dhar,  Ashley Pantuliano,  Avi Nayak,  Avital Oliver,  Barret Zoph,  Behrooz Ghorbani,  et al. (365 additional authors not shown)
摘要:GPT-4 o是一个自回归全方位模型,它接受文本、音频、图像和视频的任何组合作为输入,并生成文本、音频和图像输出的任何组合。它是跨文本、视觉和音频进行端到端训练的,这意味着所有输入和输出都由同一个神经网络处理。GPT-4 o可以在短至232毫秒的时间内对音频输入做出响应,平均为320毫秒,这与人类在对话中的响应时间相似。它在英文文本和代码上与GPT-4 Turbo性能相匹配,在非英文语言中的文本上有显着改进,同时在API中也更快,便宜50%。与现有型号相比,GPT-4 o在视觉和音频理解方面尤其出色。根据我们对安全构建人工智能的承诺,以及我们对白宫的自愿承诺,我们正在分享GPT-4 o系统卡,其中包括我们的准备框架评估。在本系统卡中,我们详细介绍了GPT-4 o的功能、限制和多个类别的安全评估,重点关注语音到语音,同时还评估了文本和图像功能,以及我们为确保模型安全和一致而实施的措施。我们还包括对危险能力的第三方评估,以及对GPT-4 o的文本和视觉能力的潜在社会影响的讨论。
摘要:GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

【6】 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
标题:OmniSep:统一的全情态声音分离和查询混音
链接:https://arxiv.org/abs/2410.21269
作者:Xize Cheng,  Siqi Zheng,  Zehan Wang,  Minghui Fang,  Ziang Zhang,  Rongjie Huang,  Ziyang Ma,  Shengpeng Ji,  Jialong Zuo,  Tao Jin,  Zhou Zhao
备注:Working in progress
摘要:近年来,规模的扩大在视觉和语言领域取得了巨大的成功。然而,当涉及到音频时,研究人员在扩大训练数据方面遇到了一个重大挑战,因为大多数自然音频都包含不同的干扰信号。为了解决这个问题,我们引入了全模态声音分离(OmniSep),一种新的框架,能够隔离干净的音轨的基础上全模态查询,包括单模态和多模态组合查询。具体来说,我们引入了Query-Mixup策略,该策略在训练过程中混合了来自不同模态的查询特征。这使得OmniSep能够同时优化多个模态,有效地将所有模态置于统一的声音分离框架下。我们进一步提高了这种灵活性,允许查询影响声音分离积极或消极,促进保留或删除特定的声音所需的。最后,OmniSep采用了一种名为Query-Aug的检索增强方法,该方法可以实现开放词汇的声音分离。对MUSIC、VGGSOUND-CLEAN+和MUSIC-CLEAN+数据集的实验评估证明了OmniSep的有效性,在文本、图像和音频查询声音分离任务中实现了最先进的性能。有关示例和更多信息,请访问演示页面:\url{https://omnisep.github.io/}。
摘要:The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}.

【7】 ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time  Optimization
标题:ST-TO:通过推理时间优化控制音频效果以实现风格转移
链接:https://arxiv.org/abs/2410.21233
作者:Christian J. Steinmetz,  Shubhr Singh,  Marco Comunità,  Ilias Ibnyahya,  Shanxin Yuan,  Emmanouil Benetos,  Joshua D. Reiss
备注:Accepted to ISMIR 2024. Code available this https URL
摘要:音频制作风格转换是处理输入以从参考录音中传递风格元素的任务。现有的方法通常训练神经网络来估计一组音频效果的控制参数。然而,这些方法是有限的,因为它们只能控制一组固定的效果,其中效果必须是可区分的或以其他方式采用专门的训练技术。在这项工作中,我们介绍了ST-ITO,推理时间优化的风格转移,一种在推理时搜索音频效果链的参数空间的方法。该方法能够控制任意音频效果链,包括不可见和不可区分的效果。我们的方法采用了音频制作风格的学习指标,我们通过简单且可扩展的自监督预训练策略以及无梯度优化器进行训练。针对现有的音频制作风格迁移评估方法的局限性,本文提出了一种多部分的音频制作风格迁移评估方法。该评估表明,我们的音频表示更好地捕捉与音频制作相关的属性,并通过控制任意音频效果实现表达风格的转移。
摘要:Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects.

【8】 SepMamba: State-space models for speaker separation using Mamba
标题:SepMamba:使用Mamba进行扬声器分离的状态空间模型
链接:https://arxiv.org/abs/2410.20997
作者:Thor Højhus Avenstrup,  Boldizsár Elek,  István László Mádi,  András Bence Schin,  Morten Mørup,  Bjørn Sand Jensen,  Kenny Falkær Olsen
摘要:近年来,基于深度学习的单通道说话人分离技术得到了显著改善,这主要归功于基于transformer的注意力机制的引入。然而,这些改进是以密集的计算需求为代价的,从而排除了它们在许多实际应用中的使用。作为一种具有类似建模功能的计算效率高的替代方案,Mamba最近被引入。我们提出SepMamba,一个基于U-Net的架构,主要由双向Mamba层组成。我们发现,我们的方法在WSJ 0 2-speaker数据集上的性能优于类似大小的突出模型(包括基于transformer的模型),同时显著降低了计算成本,内存使用量和前向传递时间。我们还报告了SepMamba的因果变体的强有力结果。我们的方法提供了一个计算上有利的替代基于变换器的架构进行深度语音分离。
摘要:Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

【9】 Atrial Fibrillation Detection System via Acoustic Sensing for Mobile  Phones
标题:通过声学传感的手机心房颤动检测系统
链接:https://arxiv.org/abs/2410.20852
作者:Xuanyu Liu,  Jiao Li,  Haoxian Liu,  Zongqi Yang,  Yi Huang,  Jin Zhang
备注:This paper has been submitted to ACM Transactions on Sensor Networks (TOSN)
摘要:心房颤动(AF)的特征是起源于心房的不规则电脉冲,其可导致严重的并发症甚至死亡。由于AF的间歇性,早期和及时监测AF对于患者预防病情进一步恶化至关重要。虽然动态心电动态监测仪提供准确的监测,这些设备的高成本阻碍了他们更广泛的采用。当前基于移动的AF检测系统提供便携式解决方案,然而,这些系统具有各种适用性问题,诸如容易受环境因素影响并且需要大量用户努力。为了克服上述限制,我们提出了MobileAF,一种使用扬声器和麦克风的新型智能手机AF检测系统。为了捕捉微小的心脏活动,我们提出了一种多通道脉搏波探测方法。此外,我们通过引入三级脉冲波净化管道来提高信号质量。更重要的是,建立了一个基于ResNet的网络模型,以实现准确可靠的AF检测。我们使用智能手机上的数据收集应用程序收集了23名参与者的数据。大量的实验结果表明,我们的系统具有优越的性能,97.9%的准确率,96.8%的精度,97.2%的召回率,98.3%的特异性,和97.0%F1评分。
摘要:Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution, however, these systems have various applicability issues such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present MobileAF, a novel smartphone-based AF detection system using speakers and microphones. In order to capture minute cardiac activities, we propose a multi-channel pulse wave probing method. In addition, we enhance the signal quality by introducing a three-stage pulse wave purification pipeline. What's more, a ResNet-based network model is built to implement accurate and reliable AF detection. We collect data from 23 participants utilizing our data collection application on the smartphone. Extensive experimental results demonstrate the superior performance of our system, with 97.9% accuracy, 96.8% precision, 97.2% recall, 98.3% specificity, and 97.0% F1 score.

【10】 Data-Efficient Low-Complexity Acoustic Scene Classification via  Distilling and Progressive Pruning
标题:通过提取和渐进修剪实现数据高效、低复杂度的声学场景分类
链接:https://arxiv.org/abs/2410.20775
作者:Bing Han,  Wen Huang,  Zhengyang Chen,  Anbai Jiang,  Pingyi Fan,  Cheng Lu,  Zhiqiang Lv,  Jia Liu,  Wei-Qiang Zhang,  Yanmin Qian
备注:submitted to ICASSP 2025
摘要:声学场景分类(ASC)任务的目标是将录音分类到预定义的声学场景类中的一个。然而,在现实世界的场景中,ASC系统经常遇到的挑战,如记录设备不匹配,低复杂性的约束,以及有限的可用性标记的数据。为了缓解这些问题,本文建立了一个数据高效和低复杂度的ASC系统,具有新的模型架构和更好的训练策略。具体来说,我们首先设计了一个新的低复杂度的架构命名为Rep-Mobile集成多卷积分支,可以重新参数化推理。与其他模型相比,它具有更好的性能和更低的计算复杂度。然后我们应用知识蒸馏策略并比较不同架构的教师模型的数据效率。最后,我们提出了一个渐进的修剪策略,它涉及修剪模型多次少量,导致更好的性能相比,一个单一的步骤修剪。在TAU数据集上进行了实验。通过Rep-Mobile和这些培训策略,我们提出的ASC系统实现了迄今为止最先进的(SOTA)结果,同时在DCASE 2024挑战赛中以明显的优势赢得了第一名。
摘要:The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.

【11】 An Ensemble Approach to Music Source Separation: A Comparative Analysis  of Conventional and Hierarchical Stem Separation
标题:音乐来源分离的整体方法:传统和分层音乐茎分离的比较分析
链接:https://arxiv.org/abs/2410.20773
作者:Saarth Vardhan,  Pavani R Acharya,  Samarth S Rao,  Oorjitha Ratna Jasthi,  S Natarajan
摘要:音乐源分离(MSS)是一项涉及从混合音频信号中分离单个声源或干的任务。本文提出了一种MSS的集成方法,结合了几种最先进的架构,以实现卓越的分离性能,在传统的声乐,鼓,低音(VDB)干,以及扩展到第二级分层分离的子干,如踢,圈套,主唱,和背景人声。我们的方法通过利用各种模型的互补优势来解决依赖于单一模型的局限性,从而在各个茎中获得更平衡的结果。对于词干选择,我们使用了信噪比(SNR)和信号失真比(SDR)的调和平均值,确保极端值不会扭曲结果,并且这两个指标都得到了有效加权。除了在VDB词干中保持一贯的高性能外,我们还探索了第二级层次分离,揭示了对MSS复杂性的重要见解,以及流派和乐器等因素如何影响模型性能。虽然二级分离结果显示出改进的空间,但分离子股骨柄的能力标志着一个重大进步。我们的研究结果为MSS的进一步研究铺平了道路,特别是在扩展VDB之外的模型功能和改善吉他和钢琴等利基茎分离方面。
摘要:Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance across traditional Vocal, Drum, and Bass (VDB) stems, as well as expanding into second-level hierarchical separation for sub-stems like kick, snare, lead vocals, and background vocals. Our method addresses the limitations of relying on a single model by utilising the complementary strengths of various models, leading to more balanced results across stems. For stem selection, we used the harmonic mean of Signal-to-Noise Ratio (SNR) and Signal-to-Distortion Ratio (SDR), ensuring that extreme values do not skew the results and that both metrics are weighted effectively. In addition to consistently high performance across the VDB stems, we also explored second-level hierarchical separation, revealing important insights into the complexities of MSS and how factors like genre and instrumentation can influence model performance. While the second-level separation results show room for improvement, the ability to isolate sub-stems marks a significant advancement. Our findings pave the way for further research in MSS, particularly in expanding model capabilities beyond VDB and improving niche stem separations such as guitar and piano.

【12】 Mitigating Unauthorized Speech Synthesis for Voice Protection
标题:缓解未经授权的语音合成以实现语音保护
链接:https://arxiv.org/abs/2410.20742
作者:Zhisheng Zhang,  Qianyi Yang,  Derui Wang,  Pengyang Huang,  Yuxin Cao,  Kai Ye,  Jie Hao
备注:Accepted to ACM CCS Workshop (LAMPS) 2024
摘要:近年来,只需几个语音样本就可以完美地复制说话者的声音,而恶意语音利用(例如,电信诈骗(以获取非法经济利益)给我们的日常生活带来了巨大的危害。因此,保护可公开访问的包含敏感信息(如个人声纹)的语音数据至关重要。大多数以前的防御方法都集中在音色相似性上欺骗说话人验证系统,但合成的deepfake语音仍然具有高质量。为了应对不断增加的风险,我们设计了一种有效、可转移且强大的主动保护技术,称为“目标扰动”(POP),该技术对原始语音样本应用不可感知的误差最小化噪声,以防止它们被有效地学习用于文本到语音(TTS)合成模型,从而无法生成高质量的deepfake语音。我们进行了广泛的实验,国家的最先进的(SOTA)TTS模型,利用客观和主观的指标,全面评估我们提出的方法。实验结果表明,在各种模型之间的出色的有效性和可移植性。与在没有保护的样本上训练的语音合成器的语音不清晰度分数21.94%相比,POP保护的样本将其显著提高到127.31%。此外,我们的方法显示出对降噪和数据增强技术的鲁棒性,从而大大降低了潜在的危险。
摘要:With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.

【13】 Using Confidence Scores to Improve Eyes-free Detection of Speech  Recognition Errors
标题:使用置信度分数改进语音识别错误的无眼检测
链接:https://arxiv.org/abs/2410.20564
作者:Sadia Nowrin,  Keith Vertanen
摘要:会话系统严重依赖语音识别来解释和响应用户命令和查询。然而,识别错误可能会发生,这可能会显着影响这种系统的性能。虽然视觉反馈可以帮助检测错误,但它可能并不总是实用的,特别是对于盲人或低视力的人。在这项研究中,我们探讨如何提高错误检测的处理的音频输出的转录文本的基础上,识别器的置信水平在其结果。我们的研究结果表明,当识别器表现出不确定性时,选择性地减慢音频,与均匀减慢音频相比,参与者的错误检测能力相对增加了12%。
摘要:Conversational systems rely heavily on speech recognition to interpret and respond to user commands and queries. Nevertheless, recognition errors may occur, which can significantly affect the performance of such systems. While visual feedback can help detect errors, it may not always be practical, especially for people who are blind or low-vision. In this study, we investigate ways to improve error detection by manipulating the audio output of the transcribed text based on the recognizer's confidence level in its result. Our findings show that selectively slowing down the audio when the recognizer exhibited uncertainty led to a relative increase of 12% in participants' error detection ability compared to uniformly slowing down the audio.

【14】 Automatic Estimation of Singing Voice Musical Dynamics
标题:歌声音乐动态的自动估计
链接:https://arxiv.org/abs/2410.20540
作者:Jyoti Narang,  Nazif Can Tamer,  Viviana De La Vega,  Xavier Serra
备注:To be published in ISMIR 2024, 6 pages
摘要:音乐的力度构成了富有表现力的歌唱声音表演的核心部分。然而,歌唱声音的音乐动力学的自动分析受到了有限的关注,部分原因是缺乏合适的数据集和缺乏明确的评估框架。为了应对这一挑战,我们提出了一种数据集策展的方法。采用所提出的方法,我们编译了一个数据集,包括509音乐动态注释的歌声表演,与163个分数文件,利用国家的最先进的源分离和对齐技术。这些乐谱来自OpenScore Lieder的浪漫主义时期作品语料库,以其丰富的表达性注释而闻名。利用策划的数据集,我们训练一个具有不同窗口大小的基于多头注意力的CNN模型,以评估估计音乐动态的有效性。我们探索了两种不同的感知动机的输入表示模型训练:对数梅尔频谱和树皮规模为基础的功能。为了进行测试,我们与专业歌手合作,手动策划了另一个包含25个音乐动态注释表演的数据集。我们通过实验得出结论,对于歌声动态预测任务,基于树皮尺度的特征优于log-Mel特征。数据集与代码一起公开共享,以进一步研究该主题。
摘要:Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.

【15】 MidiTok Visualizer: a tool for visualization and analysis of tokenized  MIDI symbolic music
标题:MidiTok可视化器:一种用于可视化和分析代币化的MIDI符号音乐的工具
链接:https://arxiv.org/abs/2410.20518
作者:Michał Wiszenko,  Kacper Stefański,  Piotr Malesa,  Łukasz Pokorzyński,  Mateusz Modrzejewski
备注:in Extended Abstracts for the Late-Breaking Demo Sessionof the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024
摘要:符号音乐研究在与音乐相关的机器学习中起着至关重要的作用,但对于那些没有音乐专业知识的人来说,符号音乐数据可能很复杂。为了解决这个问题,我们提出了MidiTok Visualizer,这是一个Web应用程序,旨在促进对MidiTok Python包中各种标记化方法的探索和可视化。MidiTok Visualizer提供了许多可自定义的参数,使用户能够上传XML文件,以可视化标记化的数据以及交互式钢琴卷。
摘要:Symbolic music research plays a crucial role in music-related machine learning, but MIDI data can be complex for those without musical expertise. To address this issue, we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MIDI tokenization methods from the MidiTok Python package. MidiTok Visualizer offers numerous customizable parameters, enabling users to upload MIDI files to visualize tokenized data alongside an interactive piano roll.

【16】 Symbotunes: unified hub for symbolic music generative models
标题:Symbotunes:符号音乐生成模型的统一中心
链接:https://arxiv.org/abs/2410.20515
作者:Paweł Skierś,  Maksymilian Łazarski,  Michał Kopeć,  Mateusz Modrzejewski
摘要:流行的符号音乐生成模型的实现通常在所使用的库和整体项目结构方面存在显着差异。因此,直接比较这些方法或熟悉它们可能会带来挑战。为了缓解这个问题,我们引入了Symbotunes,这是一个用于符号音乐生成模型的开源统一中心。Symbotunes包含用于符号音乐生成的著名方法的现代Python实现,以及用于生成和训练的统一管道。
摘要:Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.

【17】 MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
标题:MusicFlow:用于文本引导音乐生成的级联流匹配
链接:https://arxiv.org/abs/2410.20478
作者:K R Prajwal,  Bowen Shi,  Matthew Lee,  Apoorv Vyas,  Andros Tjandra,  Mahi Luthra,  Baishan Guo,  Huiyu Wang,  Triantafyllos Afouras,  David Kant,  Wei-Ning Hsu
备注:ICML 2024
摘要:我们介绍了MusicFlow,一个基于流匹配的级联文本到音乐生成模型。基于自监督表示的文本描述和音乐音频之间的桥梁,我们构建了两个流匹配网络的语义和声学特征的条件分布模型。此外,我们利用掩蔽预测作为训练目标,使模型能够以zero-shot方式推广到其他任务,如音乐填充和延续。MusicCaps上的实验表明,MusicFlow生成的音乐具有卓越的质量和文本一致性,尽管比2\sim5 $倍小,需要的迭代步骤少5 $倍。同时,该模型可以执行其他音乐生成任务,并在音乐填充和延续方面取得了竞争性的性能。我们的代码和模型将公开提供。
摘要:We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

【18】 Conditional GAN for Enhancing Diffusion Models in Efficient and  Authentic Global Gesture Generation from Audios
标题:用于增强Audios高效且真实的全球手势生成中的扩散模型的条件GAN
链接:https://arxiv.org/abs/2410.20359
作者:Yongkang Cheng,  Mingjiang Liang,  Shaoli Huang,  Gaoge Han,  Jifeng Ning,  Wei Liu
备注:Accepted by WACV 2025 (Round 1)
摘要:音频驱动的同步手势生成对于人机通信、AI游戏和电影制作至关重要。虽然以前的研究已经显示出希望,但仍然存在局限性。基于VAE的方法伴随着局部抖动和全局不稳定性的问题,而基于扩散模型的方法受到低生成效率的阻碍。这是因为后者中的DDPM的去噪过程依赖于这样的假设,即在每一步添加的噪声是从单峰分布中采样的,并且噪声值很小。DDIM借用了欧拉方法求解微分方程的思想,打乱了马尔可夫链过程,并增加了噪声步长,以减少去噪步骤的数量,从而加速生成。然而,在逐步去噪过程中简单地增加步长会导致结果逐渐偏离原始数据分布,导致生成的动作质量显著下降,并出现不自然的伪影。本文突破了DDPM的假设条件,在去噪速度和保真度方面取得了突破性进展。具体来说,我们引入了一个条件GAN来捕获音频控制信号,并在同一采样步骤内隐式地匹配扩散和去噪步骤之间的多模态去噪分布,旨在采样更大的噪声值并应用更少的去噪步骤以实现高速生成。
摘要:Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation.

【19】 An approach to hummed-tune and song sequences matching
标题:哼唱和歌曲序列匹配的方法
链接:https://arxiv.org/abs/2410.20352
作者:Loc Bao Pham,  Huong Hoang Luong,  Phu Thien Tran,  Phuc Hoang Ngo,  Vi Hoang Nguyen,  Thinh Nguyen
备注:None
摘要:旋律卡在你的头上,也被称为“耳虫”,很难摆脱,除非你再听一遍或大声唱出来。但是如果你找不到这首歌的名字呢?这一定是一种难以忍受的感觉。基于哼唱声音识别歌曲名称对于人类来说不是一件容易的事情,应该由机器来完成。然而,目前还没有关于哼音识别的研究论文发表。改编自Hum 2Song Zalo AI Challenge 2021 -一个关于通过用户给出哼唱曲调来查询歌曲名称的比赛,类似于Google的搜索引擎。本文详细介绍了从原始类型(mp3)到可用于训练和推理的形式的预处理数据。在训练特征提取阶段的嵌入模型时,我们使用一些最先进的模型进行了实验,例如ResNet,VGG,AlexNet,MobileNetV 2。在推理阶段,我们使用Faiss模块来有效地搜索与嗡嗡声序列相匹配的歌曲。在公共测试集中,MRR@10指标的结果接近94\%,并且在公共排行榜上排名前1。
摘要:Melody stuck in your head, also known as "earworm", is tough to get rid of, unless you listen to it again or sing it out loud. But what if you can not find the name of that song? It must be an intolerable feeling. Recognizing a song name base on humming sound is not an easy task for a human being and should be done by machines. However, there is no research paper published about hum tune recognition. Adapting from Hum2Song Zalo AI Challenge 2021 - a competition about querying the name of a song by user's giving humming tune, which is similar to Google's Hum to Search. This paper covers details about the pre-processed data from the original type (mp3) to usable form for training and inference. In training an embedding model for the feature extraction phase, we ran experiments with some states of the art, such as ResNet, VGG, AlexNet, MobileNetV2. And for the inference phase, we use the Faiss module to effectively search for a song that matched the sequence of humming sound. The result comes at nearly 94\% in MRR@10 metric on the public test set, along with the top 1 result on the public leaderboard.

【20】 Get Large Language Models Ready to Speak: A Late-fusion Approach for  Speech Generation
标题:让大型语言模型准备好说话:语音生成的后期融合方法
链接:https://arxiv.org/abs/2410.20336
作者:Maohao Shen,  Shun Zhang,  Jilong Wu,  Zhiping Xiu,  Ehab AlBadawy,  Yiting Lu,  Mike Seltzer,  Qing He
摘要:大型语言模型(LLM)已经彻底改变了自然语言处理(NLP),在各种基于文本的任务中具有令人印象深刻的性能。然而,将文本主导的LLM扩展到语音生成任务仍然没有得到充分的探索。在这项工作中,我们介绍了一个文本到语音(TTS)系统提供动力的微调Llama模型,名为TTS-Llama,实现国家的最先进的语音合成性能。在TTS-Llama的基础上,我们进一步提出了MoLE-Llama,这是一种通过纯后期融合参数有效微调(PEFT)和混合专家架构开发的文本和语音多模态LLM。大量的实证结果表明,MoLE-Llama的竞争力表现在文本的问题回答(QA)和TTS任务,减轻灾难性的遗忘问题,在任何一种方式。最后,我们进一步探讨MoLE-Llama在文本语音输出QA任务,展示了其作为一个多模态对话系统的语音生成能力的巨大潜力。
摘要:Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

【21】 emg2qwerty: A Large Dataset with Baselines for Touch Typing using  Surface Electromyography
标题:emg 2qwerty:具有使用表面肌电进行触摸打字基线的大型数据集
链接:https://arxiv.org/abs/2410.20081
作者:Viswanath Sivakumar,  Jeffrey Seely,  Alan Du,  Sean R Bittner,  Adam Berenzweig,  Anuoluwapo Bolarinwa,  Alexandre Gramfort,  Michael I Mandel
备注:Submitted to NeurIPS 2024 Datasets and Benchmarks Track
摘要:表面肌电图(sEMG)非侵入性地测量肌肉活动产生的信号,具有足够的灵敏度来检测单个脊髓神经元,并丰富地识别数十种手势及其细微差别。基于手腕的可穿戴sEMG传感器具有提供低摩擦、微妙、信息丰富、始终可用的人机输入的潜力。为此,我们引入emg2qwerty,这是一个在QWERTY键盘上触摸打字时在手腕处记录的非侵入性肌电信号的大规模数据集,以及地面实况注释和可重现基线。这是迄今为止最大的公共数据集,共有1,135个会话,跨越108个用户和346小时的记录。这些数据证明了从神经元到肌肉和肌肉组合的生成过程以及跨用户和用户会话的域转移方面的非平凡但定义良好的层次关系。应用标准的建模技术,从密切相关的领域的自动语音识别(ASR),我们表现出强大的基线性能预测按键使用sEMG信号单独。我们相信这个任务和数据集的丰富性将促进机器学习和神经科学界感兴趣的几个问题的进展。数据集和代码可以在https://github.com/facebookresearch/emg2qwerty上访问。
摘要:Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at https://github.com/facebookresearch/emg2qwerty.

【22】 Do Discrete Self-Supervised Representations of Speech Capture Tone  Distinctions?
标题:语音的离散自监督表示能否捕获音调差异?
链接:https://arxiv.org/abs/2410.19935
作者:Opeyemi Osakuade,  Simon King
备注:Submitted to ICASSP 2025
摘要:从自监督学习(SSL)基础模型获得的语音的离散表示被广泛使用,特别是在下游任务的数据有限的情况下,例如对于低资源语言。通常,将语音离散化为符号序列是通过对来自SSL模型的潜伏期进行无监督聚类来实现的。我们的研究评估是否离散符号-发现使用k-均值-充分捕捉音调的两个例子的语言,普通话和约鲁巴语。我们比较潜在的向量与离散符号,从HuBERT基地,MandarinHuBERT,或XLS-R,元音和声调分类。我们发现,使用离散的符号会导致大量的音调信息的损失,即使是语言专门的SSL模型。我们建议,离散化需要任务意识,特别是音调依赖的下游任务。
摘要:Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks.


机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区

语音之家
助力AI语音开发者的社区
 最新文章