语音/音频处理学术速递[11.8]

文摘   2024-11-08 18:00   北京  
今日论文合集:cs.SD语音6篇,eess.AS音频处理7篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音
【1】Multistage Fine-tuning Strategies for Automatic Speech Recognition in  Low-resource Languages
标题:低资源语言自动语音识别的多阶段微调策略
链接:https://arxiv.org/abs/2411.04573
作者:Leena G Pillai,  Kavya Manohar,  Basil K Raju,  Elizabeth Sherly
摘要:本文提出了一种新的多级微调策略,旨在使用OpenAI的Whisper模型增强低资源语言的自动语音识别(ASR)性能。在这种方法中,我们的目标是建立ASR模型的语言有限的数字资源,通过顺序适应模型在语言相似的语言。我们在马拉萨语上做了实验,这是一种德拉威语,在南印度的西高止山脉有大约一万人使用。马拉萨语言面临着技术干预的关键挑战,因为它缺乏本地脚本和数字或口语数据资源的缺乏。与Wycliffe India和Malasar社区成员合作,我们创建了一个口语Malasar语料库,并与泰米尔语脚本(一种密切相关的主要语言)的转录配对。在我们为Malasar构建ASR模型的方法中,我们首先构建一个中间泰米尔语ASR,利用泰米尔语注释语音的更高数据可用性。该中间模型随后根据Malasar数据进行微调,从而在资源有限的情况下实现更有效的ASR适应。多级微调策略表现出显着的改善,直接微调单独的马拉萨数据,实现了51.9%的字错误率(WER),这是4.5%的绝对减少相比,直接微调方法。此外,通过在后处理中删除标点符号,WER降低到47.3%,这解决了影响评估的格式不一致问题。我们的研究结果强调了顺序多级微调结合有针对性的后处理作为低资源语言ASR系统开发的可扩展策略的有效性,特别是在可以利用语言相似性来弥合训练数据差距的情况下。
摘要:This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI's Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. Malasar language faces critical challenges for technological intervention due to its lack of a native script and absence of digital or spoken data resources. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first build an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.

【2】 The Concatenator: A Bayesian Approach To Real Time Concatenative  Musaicing
标题:连环:实时连环音乐的Bayesian方法
链接:https://arxiv.org/abs/2411.04366
作者:Christopher Tralie,  Ben Cantil
备注:12 pages, 6 figures, Accepted for Publication in The International Society for Music Information Retrieval Proceedings, 2024
摘要:我们提出了“连接器”,一个实时系统的音频引导拼接合成。类似于Driedger et al.的“musaicing”(或“audio mosaicing”)技术,我们在音频语料库中连接一定数量的窗口,以重新创建目标音频流的谐波和谐波方面。然而,与Driedger的基于NMF的技术不同,我们使用的是明确的贝叶斯观点,其中语料库窗口索引是隐藏状态,目标音频流是观察。我们使用粒子滤波器来实时推断最佳隐藏语料库状态。我们的转换模型包括一个可调参数来控制语料库颗粒的时间连续性,我们的观察模型允许用户优先考虑窗口变化的速度以匹配目标。由于系统的计算复杂度与语料库大小无关,因此我们的系统可以扩展到长达数小时的语料库,这在大量音频数据收集时代是一个重要特征。在连接器模块本身中,作曲家可以改变晶粒长度,适合目标,并在对他们听到的声音做出反应的同时实时进行音高转换,使他们能够快速表达想法。为了结束我们的工作,我们评估我们的系统与广泛的定量测试参数的影响,以及定性评价与艺术的见解。基于结果的质量,我们相信实时能力为音乐表达和控制开辟了新的途径,适用于现场表演和模块化合成集成,这也代表了拼接合成技术的重大突破。
摘要:We present ``The Concatenator,'' a real time system for audio-guided concatenative synthesis. Similarly to Driedger et al.'s ``musaicing'' (or ``audio mosaicing'') technique, we concatenate a set number of windows within a corpus of audio to re-create the harmonic and percussive aspects of a target audio stream. Unlike Driedger's NMF-based technique, however, we instead use an explicitly Bayesian point of view, where corpus window indices are hidden states and the target audio stream is an observation. We use a particle filter to infer the best hidden corpus states in real-time. Our transition model includes a tunable parameter to control the time-continuity of corpus grains, and our observation model allows users to prioritize how quickly windows change to match the target. Because the computational complexity of the system is independent of the corpus size, our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections. Within The Concatenator module itself, composers can vary grain length, fit to target, and pitch shift in real time while reacting to the sounds they hear, enabling them to rapidly iterate ideas. To conclude our work, we evaluate our system with extensive quantitative tests of the effects of parameters, as well as a qualitative evaluation with artistic insights. Based on the quality of the results, we believe the real-time capability unlocks new avenues for musical expression and control, suitable for live performance and modular synthesis integration, which furthermore represents an essential breakthrough in concatenative synthesis technology.

【3】 Model and Deep learning based Dynamic Range Compression Inversion
标题:基于模型和深度学习的动态范围压缩倒置
链接:https://arxiv.org/abs/2411.04337
作者:Haoran Sun,  Dominique Fourer,  Hichem Maaref
摘要:动态范围压缩(DRC)是一种流行的音频效果,用于控制信号的动态范围。反转DRC还可以帮助恢复原始动态以产生新的混音和/或提高音频信号的整体质量。由于,国家的最先进的DRC反演技术要么忽略参数或需要精确的参数是难以估计的,我们填补了空白,结合基于模型的方法与神经网络的DRC反演。为此,根据不同的场景,我们使用不同的神经网络来估计DRC参数。然后,完成基于模型的反演以恢复原始音频信号。我们的实验结果表明,该方法的有效性和鲁棒性相比,几个国家的最先进的方法,当应用于两个音乐数据集。
摘要:Dynamic Range Compression (DRC) is a popular audio effect used to control the dynamic range of a signal. Inverting DRC can also help to restore the original dynamics to produce new mixes and/or to improve the overall quality of the audio signal. Since, state-of-the-art DRC inversion techniques either ignore parameters or require precise parameters that are difficult to estimate, we fill the gap by combining a model-based approach with neural networks for DRC inversion. To this end, depending on the scenario, we use different neural networks to estimate DRC parameters. Then, a model-based inversion is completed to restore the original audio signal. Our experimental results show the effectiveness and robustness of the proposed method in comparison to several state-of-the-art methods, when applied on two music datasets.

【4】 Analyzing Multimodal Features of Spontaneous Voice Assistant Commands  for Mild Cognitive Impairment Detection
标题:分析用于轻度认知障碍检测的自发语音助手命令的多模式特征
链接:https://arxiv.org/abs/2411.04158
作者:Nana Lin,  Youxiang Zhu,  Xiaohui Liang,  John A. Batsis,  Caroline Summerour
摘要:轻度认知障碍(MCI)是一个主要的公共卫生问题,由于其进展为痴呆症的高风险。本研究探讨了潜在的检测MCI与自发的语音助理(VA)命令从35名老年人在一个受控的设置。具体而言,命令生成任务被设计为具有预定义的意图,以便参与者自由地生成比阅读命令更与认知能力相关的命令。我们开发了具有音频,文本,意图和多模态融合特征的MCI分类和回归模型。我们发现命令生成任务优于命令读取任务,平均分类准确率为82%,通过利用多模态融合功能实现。此外,生成的命令与记忆和注意力子域的相关性比读取命令更强。我们的结果证实了命令生成任务的有效性,并意味着使用纵向家庭命令进行MCI检测的希望。
摘要:Mild cognitive impairment (MCI) is a major public health concern due to its high risk of progressing to dementia. This study investigates the potential of detecting MCI with spontaneous voice assistant (VA) commands from 35 older adults in a controlled setting. Specifically, a command-generation task is designed with pre-defined intents for participants to freely generate commands that are more associated with cognitive ability than read commands. We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Our results confirm the effectiveness of the command-generation task and imply the promise of using longitudinal in-home commands for MCI detection.

【5】 A Contrastive Self-Supervised Learning scheme for beat tracking amenable  to few-shot learning
标题:适用于少数镜头学习的节拍跟踪对比自我监督学习方案
链接:https://arxiv.org/abs/2411.04152
作者:Antonin Gagnere (LTCI, IDS, S2A),  Geoffroy Peeters (LTCI, S2A, IDS),  Slim Essid (IDS, S2A, LTCI)
备注:None
摘要:在本文中,我们提出了一种新的自我监督学习计划训练节奏分析系统,并实例化它的Few-Shot节拍跟踪。从对比预测编码范例中获得灵感,我们提出训练对数梅尔频谱图Transformer编码器,以对比由假设的心跳间隔隔开的时间的观察结果。我们在不知道地面实况速度或节拍位置的情况下这样做,因为我们依赖于被认为是Tatum位置的代理的主导局部脉冲函数的局部最大值来定义候选锚点、候选阳性(位于距锚点2的幂的距离处)和阴性(剩余时间位置)。我们表明,使用这种方法在未标记的FMA,MTT和MTG-Jamendo数据集上预训练的模型可以在Few-Shot机制中成功地进行微调,即只需几个注释示例即可获得具有竞争力的节拍跟踪性能。
摘要:In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

【6】 Unified Pathological Speech Analysis with Prompt Tuning
标题:统一的病理言语分析和即时调整
链接:https://arxiv.org/abs/2411.04142
作者:Fei Yang,  Xuenan Xu,  Mengyue Wu,  Kai Yu
备注:This work has been submitted to the IEEE for possible publication
摘要:病理语音分析在某些疾病如抑郁症和阿尔茨海默病的检测中一直很有意义,并吸引了研究人员的极大兴趣。然而,以前的病理语音分析模型通常是针对特定疾病而设计的,而忽略了疾病之间的联系,这可能会限制性能并降低训练效率。与针对不同任务微调深度模型不同,快速调优是一种更有效的训练范式。因此,我们提出了一个统一的病理语音分析系统多达三种疾病的提示调谐技术。该系统使用快速调谐来调整仅一小部分参数,以从可能的患者的语音中检测不同的疾病。我们的系统利用了预先训练的口语模型,并在多种疾病中表现出强大的性能,同时只微调了一小部分参数。这种高效的训练方法通过允许跨任务共享知识来加快收敛速度并提高F1分数。我们对阿尔茨海默氏病,抑郁症和帕金森氏病的实验显示出竞争力的结果,突出了我们的方法在病理语音分析的有效性。
摘要:Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer's disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly designed for a specific disease while overlooking the connection between diseases, which may constrain performance and lower training efficiency. Instead of fine-tuning deep models for different tasks, prompt tuning is a much more efficient training paradigm. We thus propose a unified pathological speech analysis system for as many as three diseases with the prompt tuning technique. This system uses prompt tuning to adjust only a small part of the parameters to detect different diseases from speeches of possible patients. Our system leverages a pre-trained spoken language model and demonstrates strong performance across multiple disorders while only fine-tuning a fraction of the parameters. This efficient training approach leads to faster convergence and improved F1 scores by allowing knowledge to be shared across tasks. Our experiments on Alzheimer's disease, Depression, and Parkinson's disease show competitive results, highlighting the effectiveness of our method in pathological speech analysis.

eess.AS音频处理

【1】 A Pre-training Framework that Encodes Noise Information for Speech  Quality Assessment
标题:编码噪音信息以进行语音质量评估的预训练框架
链接:https://arxiv.org/abs/2411.04379
作者:Subrina Sultana,  Donald S. Williamson
摘要:自监督学习(SSL)在语音处理界引起了越来越多的兴趣,因为它产生了对许多下游任务有用的表示。SSL使用全局和上下文方法来生成鲁棒的表示,其中SSL甚至优于监督模型。然而,大多数自我监督方法仅限于嵌入关于以下内容的信息,即,音素、说话者身份和情感转换为提取的表示,其中由于对比和自回归学习,它们变得对背景声音不变。这是有限的,因为许多下游任务利用噪声信息来精确地运行。因此,我们提出了一个预训练框架,以监督的方式学习与背景噪声有关的信息,同时使用自监督策略联合嵌入语音信息。我们的实验与多个编码器,并表明我们的框架是有用的感知语音质量估计,这依赖于背景线索。我们的研究结果表明,所提出的方法提高了性能,更少的参数,相比多个基线。
摘要:Self-supervised learning (SSL) has grown in interest within the speech processing community, since it produces representations that are useful for many downstream tasks. SSL uses global and contextual methods to produce robust representations, where SSL even outperforms supervised models. Most self-supervised approaches, however, are limited to embedding information about, i.e., the phonemes, speaker identity, and emotion, into the extracted representations, where they become invariant to background sounds due to contrastive and auto-regressive learning. This is limiting because many downstream tasks leverage noise information to function accurately. Therefore, we propose a pre-training framework that learns information pertaining to background noise in a supervised manner, while jointly embedding speech information using a self-supervised strategy. We experiment with multiple encoders and show that our framework is useful for perceptual speech quality estimation, which relies on background cues. Our results show that the proposed approach improves performance with fewer parameters, in comparison to multiple baselines.

【2】 Analyzing Multimodal Features of Spontaneous Voice Assistant Commands  for Mild Cognitive Impairment Detection
标题:分析用于轻度认知障碍检测的自发语音助手命令的多模式特征
链接:https://arxiv.org/abs/2411.04158
作者:Nana Lin,  Youxiang Zhu,  Xiaohui Liang,  John A. Batsis,  Caroline Summerour
摘要:轻度认知障碍(MCI)是一个主要的公共卫生问题,由于其进展为痴呆症的高风险。本研究探讨了潜在的检测MCI与自发的语音助理(VA)命令从35名老年人在一个受控的设置。具体而言,命令生成任务被设计为具有预定义的意图,以便参与者自由地生成比阅读命令更与认知能力相关的命令。我们开发了具有音频,文本,意图和多模态融合特征的MCI分类和回归模型。我们发现命令生成任务优于命令读取任务,平均分类准确率为82%,通过利用多模态融合功能实现。此外,生成的命令与记忆和注意力子域的相关性比读取命令更强。我们的研究结果证实了命令生成任务的有效性,并暗示了使用纵向在家命令进行MCI检测的前景。
摘要:Mild cognitive impairment (MCI) is a major public health concern due to its high risk of progressing to dementia. This study investigates the potential of detecting MCI with spontaneous voice assistant (VA) commands from 35 older adults in a controlled setting. Specifically, a command-generation task is designed with pre-defined intents for participants to freely generate commands that are more associated with cognitive ability than read commands. We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Our results confirm the effectiveness of the command-generation task and imply the promise of using longitudinal in-home commands for MCI detection.

【3】 A Contrastive Self-Supervised Learning scheme for beat tracking amenable  to few-shot learning
标题:适用于少数镜头学习的节拍跟踪对比自我监督学习方案
链接:https://arxiv.org/abs/2411.04152
作者:Antonin Gagnere (LTCI, IDS, S2A),  Geoffroy Peeters (LTCI, S2A, IDS),  Slim Essid (IDS, S2A, LTCI)
备注:None
摘要:在本文中,我们提出了一种新的自我监督学习计划训练节奏分析系统,并实例化它的Few-Shot节拍跟踪。从对比预测编码范例中获得灵感,我们提出训练对数梅尔频谱图Transformer编码器,以对比由假设的心跳间隔隔开的时间的观察结果。我们在不知道地面实况速度或节拍位置的情况下这样做,因为我们依赖于被认为是Tatum位置的代理的主导局部脉冲函数的局部最大值来定义候选锚点、候选阳性(位于距锚点2的幂的距离处)和阴性(剩余时间位置)。我们表明,使用这种方法在未标记的FMA,MTT和MTG-Jamendo数据集上预训练的模型可以在Few-Shot机制中成功地进行微调,即只需几个注释示例即可获得具有竞争力的节拍跟踪性能。
摘要:In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

【4】 Unified Pathological Speech Analysis with Prompt Tuning
标题:统一的病理言语分析和即时调整
链接:https://arxiv.org/abs/2411.04142
作者:Fei Yang,  Xuenan Xu,  Mengyue Wu,  Kai Yu
备注:This work has been submitted to the IEEE for possible publication
摘要:病理语音分析在某些疾病如抑郁症和阿尔茨海默病的检测中一直很有意义,并吸引了研究人员的极大兴趣。然而,以前的病理语音分析模型通常是针对特定疾病而设计的,而忽略了疾病之间的联系,这可能会限制性能并降低训练效率。与针对不同任务微调深度模型不同,快速调优是一种更有效的训练范式。因此,我们提出了一个统一的病理语音分析系统多达三种疾病的提示调谐技术。该系统使用快速调谐来调整仅一小部分参数,以从可能的患者的语音中检测不同的疾病。我们的系统利用了预先训练的口语模型,并在多种疾病中表现出强大的性能,同时只微调了一小部分参数。这种高效的训练方法通过允许跨任务共享知识来加快收敛速度并提高F1分数。我们对阿尔茨海默氏病,抑郁症和帕金森氏病的实验显示出竞争力的结果,突出了我们的方法在病理语音分析的有效性。
摘要:Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer's disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly designed for a specific disease while overlooking the connection between diseases, which may constrain performance and lower training efficiency. Instead of fine-tuning deep models for different tasks, prompt tuning is a much more efficient training paradigm. We thus propose a unified pathological speech analysis system for as many as three diseases with the prompt tuning technique. This system uses prompt tuning to adjust only a small part of the parameters to detect different diseases from speeches of possible patients. Our system leverages a pre-trained spoken language model and demonstrates strong performance across multiple disorders while only fine-tuning a fraction of the parameters. This efficient training approach leads to faster convergence and improved F1 scores by allowing knowledge to be shared across tasks. Our experiments on Alzheimer's disease, Depression, and Parkinson's disease show competitive results, highlighting the effectiveness of our method in pathological speech analysis.

【5】 Multistage Fine-tuning Strategies for Automatic Speech Recognition in  Low-resource Languages
标题:低资源语言自动语音识别的多阶段微调策略
链接:https://arxiv.org/abs/2411.04573
作者:Leena G Pillai,  Kavya Manohar,  Basil K Raju,  Elizabeth Sherly
摘要:本文提出了一种新的多级微调策略,旨在使用OpenAI的Whisper模型增强低资源语言的自动语音识别(ASR)性能。在这种方法中,我们的目标是建立ASR模型的语言有限的数字资源,通过顺序适应模型在语言相似的语言。我们在马拉萨语上做了实验,这是一种德拉威语,在南印度的西高止山脉有大约一万人使用。马拉萨语言面临着技术干预的关键挑战,因为它缺乏本地脚本和数字或口语数据资源的缺乏。与Wycliffe India和Malasar社区成员合作,我们创建了一个口语Malasar语料库,并与泰米尔语脚本(一种密切相关的主要语言)的转录配对。在我们为Malasar构建ASR模型的方法中,我们首先构建一个中间泰米尔语ASR,利用泰米尔语注释语音的更高数据可用性。该中间模型随后根据Malasar数据进行微调,从而在资源有限的情况下实现更有效的ASR适应。多级微调策略表现出显着的改善,直接微调单独的马拉萨数据,实现了51.9%的字错误率(WER),这是4.5%的绝对减少相比,直接微调方法。此外,通过在后处理中删除标点符号,WER降低到47.3%,这解决了影响评估的格式不一致问题。我们的研究结果强调了顺序多级微调结合有针对性的后处理作为低资源语言ASR系统开发的可扩展策略的有效性,特别是在可以利用语言相似性来弥合训练数据差距的情况下。
摘要:This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI's Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. Malasar language faces critical challenges for technological intervention due to its lack of a native script and absence of digital or spoken data resources. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first build an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.

【6】 The Concatenator: A Bayesian Approach To Real Time Concatenative  Musaicing
标题:连环:实时连环音乐的Bayesian方法
链接:https://arxiv.org/abs/2411.04366
作者:Christopher Tralie,  Ben Cantil
备注:12 pages, 6 figures, Accepted for Publication in The International Society for Music Information Retrieval Proceedings, 2024
摘要:我们提出了“连接器”,一个实时系统的音频引导拼接合成。类似于Driedger et al.的“musaicing”(或“audio mosaicing”)技术,我们在音频语料库中连接一定数量的窗口,以重新创建目标音频流的谐波和谐波方面。然而,与Driedger的基于NMF的技术不同,我们使用的是明确的贝叶斯观点,其中语料库窗口索引是隐藏状态,目标音频流是观察。我们使用粒子滤波器来实时推断最佳隐藏语料库状态。我们的转换模型包括一个可调参数来控制语料库颗粒的时间连续性,我们的观察模型允许用户优先考虑窗口变化的速度以匹配目标。由于系统的计算复杂度与语料库大小无关,因此我们的系统可以扩展到长达数小时的语料库,这在大量音频数据收集时代是一个重要特征。在连接器模块本身中,作曲家可以改变晶粒长度,适合目标,并在对他们听到的声音做出反应的同时实时进行音高转换,使他们能够快速表达想法。为了结束我们的工作,我们评估我们的系统与广泛的定量测试参数的影响,以及定性评价与艺术的见解。基于结果的质量,我们相信实时能力为音乐表达和控制开辟了新的途径,适用于现场表演和模块化合成集成,这也代表了拼接合成技术的重大突破。
摘要:We present ``The Concatenator,'' a real time system for audio-guided concatenative synthesis. Similarly to Driedger et al.'s ``musaicing'' (or ``audio mosaicing'') technique, we concatenate a set number of windows within a corpus of audio to re-create the harmonic and percussive aspects of a target audio stream. Unlike Driedger's NMF-based technique, however, we instead use an explicitly Bayesian point of view, where corpus window indices are hidden states and the target audio stream is an observation. We use a particle filter to infer the best hidden corpus states in real-time. Our transition model includes a tunable parameter to control the time-continuity of corpus grains, and our observation model allows users to prioritize how quickly windows change to match the target. Because the computational complexity of the system is independent of the corpus size, our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections. Within The Concatenator module itself, composers can vary grain length, fit to target, and pitch shift in real time while reacting to the sounds they hear, enabling them to rapidly iterate ideas. To conclude our work, we evaluate our system with extensive quantitative tests of the effects of parameters, as well as a qualitative evaluation with artistic insights. Based on the quality of the results, we believe the real-time capability unlocks new avenues for musical expression and control, suitable for live performance and modular synthesis integration, which furthermore represents an essential breakthrough in concatenative synthesis technology.

【7】 Model and Deep learning based Dynamic Range Compression Inversion
标题:基于模型和深度学习的动态范围压缩倒置
链接:https://arxiv.org/abs/2411.04337
作者:Haoran Sun,  Dominique Fourer,  Hichem Maaref
摘要:动态范围压缩(DRC)是一种流行的音频效果,用于控制信号的动态范围。反转DRC还可以帮助恢复原始动态以产生新的混音和/或提高音频信号的整体质量。由于,国家的最先进的DRC反演技术要么忽略参数或需要精确的参数是难以估计的,我们填补了空白,结合基于模型的方法与神经网络的DRC反演。为此,根据场景的不同,我们使用不同的神经网络来估计刚果民主共和国参数。然后,完成基于模型的反演以恢复原始音频信号。我们的实验结果表明,该方法的有效性和鲁棒性相比,几个国家的最先进的方法,当应用于两个音乐数据集。
摘要:Dynamic Range Compression (DRC) is a popular audio effect used to control the dynamic range of a signal. Inverting DRC can also help to restore the original dynamics to produce new mixes and/or to improve the overall quality of the audio signal. Since, state-of-the-art DRC inversion techniques either ignore parameters or require precise parameters that are difficult to estimate, we fill the gap by combining a model-based approach with neural networks for DRC inversion. To this end, depending on the scenario, we use different neural networks to estimate DRC parameters. Then, a model-based inversion is completed to restore the original audio signal. Our experimental results show the effectiveness and robustness of the proposed method in comparison to several state-of-the-art methods, when applied on two music datasets.

机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区

语音之家
助力AI语音开发者的社区
 最新文章