语音/音频处理学术速递[11.7]

文摘   2024-11-07 18:04   北京  
今日论文合集:cs.SD语音4篇,eess.AS音频处理4篇。

本文经arXiv每日学术速递授权转载

微信公众号:arXiv_Daily

cs.SD语音
【1】 Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of  Study in Tabletop Role-Playing Games Soundtracks
标题:具有自适应脚本的长格式文本到音乐生成:桌面角色扮演游戏配乐的研究案例
链接:https://arxiv.org/abs/2411.03948
作者:Felipe Marra,  Lucas N. Ferreira
备注:Paper accepted at the LAMIR 2024 workshop
摘要:本文研究了文本到音频的音乐生成模型的能力,在生产长形式的音乐与提示,随着时间的推移而变化,专注于桌面角色扮演游戏(TRPG)的配乐生成。我们介绍了巴贝尔巴尔多,一个系统,使用大型语言模型(LLM)的语音转换成音乐描述控制的文本到音乐的模型。在两个TRPG活动中比较了四个版本的巴贝尔巴尔多:一个使用直接语音传输的基线,以及三个基于LLM的版本,这些版本具有不同的音乐描述生成方法。评估考虑了音频质量、故事对齐和过渡平滑度。结果表明,详细的音乐描述提高音频质量,同时保持连续描述的一致性,增强故事对齐和过渡平滑。
摘要:This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

【2】 Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the  Way Forward
标题:SLAM-ASB绩效评估:好的、坏的、丑陋的和前进的道路
链接:https://arxiv.org/abs/2411.03866
作者:Shashi Kumar,  Iuliia Thorbecke,  Sergio Burdisso,  Esaú Villatoro-Tello,  Manjunath K E,  Kadri Hacioğlu,  Pradeep Rangappa,  Petr Motlicek,  Aravind Ganapathiraju,  Andreas Stolcke
备注:Submitted to ICASSP 2025 SALMA Workshop
摘要:最近的研究表明,训练语音基础编码器和大型语言模型(LLM)之间的线性连接器使该架构能够实现强大的ASR功能。尽管结果令人印象深刻,但仍不清楚这些简单的方法在不同的场景和语音条件下是否足够鲁棒,例如域偏移和不同的语音扰动。在本文中,我们解决这些问题进行各种消融实验,使用最近被广泛采用的方法称为SLAM-ASR。我们提出了新的实证研究结果,提供了关于如何有效地利用SLAM-ASR架构在广泛的设置的见解。我们的主要研究结果表明,SLAM-ASR在跨域评估设置中表现不佳。此外,域内数据中的语音扰动(例如速度变化或加性噪声的存在)会显著影响性能。我们的研究结果为微调和配置强大的基于LLM的ASR模型提供了重要的见解,这些模型针对不同的数据特征和计算资源进行了定制。
摘要:Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

【3】 MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech  Quality Assessment Models
标题:MOS-Bench:主观语音质量评估模型的概括能力基准
链接:https://arxiv.org/abs/2411.03715
作者:Wen-Chin Huang,  Erica Cooper,  Tomoki Toda
备注:Submitted to Transactions on Audio, Speech and Language Processing. This work has been submitted to the IEEE for possible publication
摘要:主观语音质量评价(SSQA)是评价语音样本的关键。虽然由于深度神经网络(DNN)的发展,基于模型的SSQA取得了巨大的成功,但泛化仍然是一个关键挑战,特别是对于看不见的域外数据。为了对SSQA模型的泛化能力进行基准测试,我们提出了MOS-Bench,一个不同的数据集集合。此外,我们还介绍了SHEET,一个开源工具包,包含完整的食谱进行SSQA实验。我们为MOS-Bench提供了基准测试结果,还探索了多数据集训练以增强泛化能力。此外,我们提出了一个新的性能指标,最佳得分差异/比率,并使用潜在空间可视化来解释模型行为,为未来的研究提供了有价值的见解。
摘要:Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.

【4】 Mobile Recording Device Recognition Based Cross-Scale and Multi-Level  Representation Learning
标题:基于跨尺度和多层次表示学习的移动录音设备识别
链接:https://arxiv.org/abs/2411.03668
作者:Chunyan Zeng,  Yuhao Zhao,  Zhifeng Wang
备注:16 pages
摘要:本文介绍了一种建模方法,采用多层次的全球处理,包括短期帧级和长期样本级的功能尺度。在浅层特征提取的初始阶段,采用不同的尺度提取多层特征,包括梅尔倒谱系数(MFCC)和前Fbank测井能量谱。识别网络模型的构建涉及从帧和样本两个级别考虑输入的二维时间特征。具体而言,该模型最初采用基于一维卷积的卷积长短期记忆(ConvLSTM)来融合时空信息并提取短期帧级特征。随后,利用双向长短期记忆(BiLSTM)来学习长期样本级顺序表示。然后,Transformer编码器对全局帧级和样本级特征执行跨尺度、多级处理,从而促进两个级别的深度特征表示和融合。最后通过Softmax软件进行识别,得到识别结果。我们的方法在CCNU_Mobile数据集上实现了令人印象深刻的99.6%的识别准确率,与基线系统相比显着提高了2%至12%。此外,我们彻底研究了我们模型的可移植性,在新数据集上的分类任务中达到了87.9%的准确率。
摘要:This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.

eess.AS音频处理

【1】 Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of  Study in Tabletop Role-Playing Games Soundtracks
标题:具有自适应脚本的长格式文本到音乐生成:桌面角色扮演游戏配乐的研究案例
链接:https://arxiv.org/abs/2411.03948
作者:Felipe Marra,  Lucas N. Ferreira
备注:Paper accepted at the LAMIR 2024 workshop
摘要:本文研究了文本到音频的音乐生成模型的能力,在生产长形式的音乐与提示,随着时间的推移而变化,专注于桌面角色扮演游戏(TRPG)的配乐生成。我们介绍了巴贝尔巴尔多,一个系统,使用大型语言模型(LLM)的语音转换成音乐描述控制的文本到音乐的模型。在两个TRPG活动中比较了四个版本的巴贝尔巴尔多:一个使用直接语音传输的基线,以及三个基于LLM的版本,这些版本具有不同的音乐描述生成方法。评估考虑了音频质量、故事对齐和过渡平滑度。结果表明,详细的音乐描述提高音频质量,同时保持连续描述的一致性,增强故事对齐和过渡平滑。
摘要:This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

【2】 Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the  Way Forward
标题:SLAM-ASB绩效评估:好的、坏的、丑陋的和前进的道路
链接:https://arxiv.org/abs/2411.03866
作者:Shashi Kumar,  Iuliia Thorbecke,  Sergio Burdisso,  Esaú Villatoro-Tello,  Manjunath K E,  Kadri Hacioğlu,  Pradeep Rangappa,  Petr Motlicek,  Aravind Ganapathiraju,  Andreas Stolcke
备注:Submitted to ICASSP 2025 SALMA Workshop
摘要:最近的研究表明,训练语音基础编码器和大型语言模型(LLM)之间的线性连接器使该架构能够实现强大的ASR功能。尽管结果令人印象深刻,但仍不清楚这些简单的方法在不同的场景和语音条件下是否足够鲁棒,例如域偏移和不同的语音扰动。在本文中,我们解决这些问题进行各种消融实验,使用最近被广泛采用的方法称为SLAM-ASR。我们提出了新的实证研究结果,提供了关于如何有效地利用SLAM-ASR架构在广泛的设置的见解。我们的主要研究结果表明,SLAM-ASR在跨域评估设置中表现不佳。此外,域内数据中的语音扰动(例如速度变化或加性噪声的存在)会显著影响性能。我们的研究结果为微调和配置强大的基于LLM的ASR模型提供了重要的见解,这些模型针对不同的数据特征和计算资源进行了定制。
摘要:Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

【3】 MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech  Quality Assessment Models
标题:MOS-Bench:主观语音质量评估模型的概括能力基准
链接:https://arxiv.org/abs/2411.03715
作者:Wen-Chin Huang,  Erica Cooper,  Tomoki Toda
备注:Submitted to Transactions on Audio, Speech and Language Processing. This work has been submitted to the IEEE for possible publication
摘要:主观语音质量评价(SSQA)是评价语音样本的关键。虽然由于深度神经网络(DNN)的发展,基于模型的SSQA取得了巨大的成功,但泛化仍然是一个关键挑战,特别是对于看不见的域外数据。为了对SSQA模型的泛化能力进行基准测试,我们提出了MOS-Bench,一个不同的数据集集合。此外,我们还介绍了SHEET,一个开源工具包,包含完整的食谱进行SSQA实验。我们为MOS-Bench提供了基准测试结果,还探索了多数据集训练以增强泛化能力。此外,我们提出了一个新的性能指标,最佳得分差异/比率,并使用潜在空间可视化来解释模型行为,为未来的研究提供了有价值的见解。
摘要:Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.

【4】 Mobile Recording Device Recognition Based Cross-Scale and Multi-Level  Representation Learning
标题:基于跨尺度和多层次表示学习的移动录音设备识别
链接:https://arxiv.org/abs/2411.03668
作者:Chunyan Zeng,  Yuhao Zhao,  Zhifeng Wang
备注:16 pages
摘要:本文介绍了一种建模方法,采用多层次的全球处理,包括短期帧级和长期样本级的功能尺度。在浅层特征提取的初始阶段,采用不同的尺度提取多层特征,包括梅尔倒谱系数(MFCC)和前Fbank测井能量谱。识别网络模型的构建涉及从帧和样本两个级别考虑输入的二维时间特征。具体而言,该模型最初采用基于一维卷积的卷积长短期记忆(ConvLSTM)来融合时空信息并提取短期帧级特征。随后,利用双向长短期记忆(BiLSTM)来学习长期样本级顺序表示。然后,Transformer编码器对全局帧级和样本级特征执行跨尺度、多级处理,从而促进两个级别的深度特征表示和融合。最后通过Softmax软件进行识别,得到识别结果。我们的方法在CCNU_Mobile数据集上实现了令人印象深刻的99.6%的识别准确率,与基线系统相比显着提高了2%至12%。此外,我们彻底研究了我们模型的可移植性,在新数据集上的分类任务中达到了87.9%的准确率。
摘要:This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.

机器翻译由腾讯交互翻译提供,仅供参考

永久福利 直投简历
简历投递:join@speechhome.com
扫码关注我们
助力AI语音开发者的社区

语音之家
助力AI语音开发者的社区
 最新文章