本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2410.22299
备注:2024 6th Asian Digital Image Processing Conference
摘要:从图像生成音乐可以增强各种应用,包括用于照片幻灯片、社交媒体体验和视频创作的背景音乐。本文提出了一个情感引导的图像到音乐生成框架,利用效价唤醒(VA)情感空间,以产生音乐,符合给定图像的情感基调。与以前依赖于对比学习来获得情感一致性的模型不同,所提出的方法直接集成了VA损失函数,以实现准确的情感对齐。该模型采用CNN-Transformer架构,具有预训练的CNN图像特征提取器和三个Transformer编码器,以从电子音乐中捕获复杂的高级情感特征。三个Transformer解码器完善了这些功能,以生成音乐和情感一致的序列。在一个新策划的情感配对图像数据集上的实验结果证明了所提出的模型在复调率、音高熵、槽一致性和收敛损失等指标上的卓越性能。
摘要:Generating music from images can enhance various applications, including background music for photo slideshows, social media experiences, and video creation. This paper presents an emotion-guided image-to-music generation framework that leverages the Valence-Arousal (VA) emotional space to produce music that aligns with the emotional tone of a given image. Unlike previous models that rely on contrastive learning for emotional consistency, the proposed approach directly integrates a VA loss function to enable accurate emotional alignment. The model employs a CNN-Transformer architecture, featuring pre-trained CNN image feature extractors and three Transformer encoders to capture complex, high-level emotional features from MIDI music. Three Transformer decoders refine these features to generate musically and emotionally consistent MIDI sequences. Experimental results on a newly curated emotionally paired image-MIDI dataset demonstrate the proposed model's superior performance across metrics such as Polyphony Rate, Pitch Entropy, Groove Consistency, and loss convergence.
标题:非常细心的Tacotron:基于自回归转换器的文本到语音中的鲁棒且无限长度概括
链接:https://arxiv.org/abs/2410.22179
备注:Submitted to NAACL
摘要:已知基于自回归(AR)变换的序列模型难以推广到比训练期间看到的序列更长的序列。当应用于文本到语音(TTS),这些模型往往会下降或重复的话或产生不稳定的输出,特别是对于较长的话语。在本文中,我们介绍了增强针对AR变换为基础的编码器-解码器TTS系统,解决这些鲁棒性和长度泛化问题。我们的方法使用对齐机制提供交叉注意操作与相对位置信息。关联的对齐位置通过反向传播作为模型的潜在属性来学习,并且在训练期间不需要外部对齐信息。虽然该方法是针对TTS输入输出对齐的单调性质而定制的,但它仍然能够受益于交织多头自注意和交叉注意操作的灵活建模能力。结合这些改进的系统,我们称之为“非常专注的Tacotron”,与基于T5的TTS系统的自然性和表现力相匹配,同时消除了重复或丢失单词的问题,并能够推广到任何实际的话语长度。
摘要:Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
标题:排名上升:使用辅助排名分类器增强半监督回归
链接:https://arxiv.org/abs/2410.22124
备注:Accepted at NeurIPS 2024 (Poster)
摘要:最先进的(SOTA)半监督学习技术,如FixMatch及其变体,在分类任务中表现出令人印象深刻的性能。然而,这些方法并不直接适用于回归任务。在本文中,我们提出了RankUp,这是一种简单而有效的方法,它可以适应现有的半监督分类技术,以提高回归任务的性能。RankUp通过将原始回归任务转换为排名问题并与原始回归目标同时进行训练来实现这一点。该辅助排序分类器输出分类结果,从而能够与现有的半监督分类方法集成。此外,我们引入了回归分布对齐(RDA),这是一种补充技术,通过分布对齐改进伪标签,进一步增强了RankUp的性能。尽管RankUp很简单,但无论是否有RDA,它都能在一系列回归基准测试中实现SOTA结果,包括计算机视觉、音频和自然语言处理任务。我们的代码和日志数据在https://github.com/pm25/semi-supervised-regression上开源。
摘要:State-of-the-art (SOTA) semi-supervised learning techniques, such as FixMatch and it's variants, have demonstrated impressive performance in classification tasks. However, these methods are not directly applicable to regression tasks. In this paper, we present RankUp, a simple yet effective approach that adapts existing semi-supervised classification techniques to enhance the performance of regression tasks. RankUp achieves this by converting the original regression task into a ranking problem and training it concurrently with the original regression objective. This auxiliary ranking classifier outputs a classification result, thus enabling integration with existing semi-supervised classification methods. Moreover, we introduce regression distribution alignment (RDA), a complementary technique that further enhances RankUp's performance by refining pseudo-labels through distribution alignment. Despite its simplicity, RankUp, with or without RDA, achieves SOTA results in across a range of regression benchmarks, including computer vision, audio, and natural language processing tasks. Our code and log data are open-sourced at https://github.com/pm25/semi-supervised-regression.
标题:USpeech:通过跨模式合成,以最少的人力实现超声波增强语音
链接:https://arxiv.org/abs/2410.22076
摘要:语音增强在人机交互中至关重要,特别是对于无处不在的设备。基于超声波的语音增强由于其优越的普遍性和性能而成为一种有吸引力的选择。然而,在音频超声数据采集过程中,来自意外和非预期来源的不可避免的干扰使得现有解决方案严重依赖于人工进行数据收集和处理。这导致数据严重不足,限制了基于超声波的语音增强的全部潜力。为了解决这个问题,我们提出了USpeech,一个跨模态的超声语音增强合成框架,以最小的人力。其核心是一个两阶段的框架,通过利用可听音频作为桥梁,建立视觉和超声模态之间的对应关系。这种方法克服了缺乏成对的视频超声数据集和视频和超声数据之间的固有异质性的挑战。我们的框架结合了对比视频-音频预训练,将模态投影到共享的语义空间中,并采用音频-超声编码器-解码器进行超声合成。然后,我们提出了一个语音增强网络,可以在时频域增强语音,并通过神经声码器恢复干净的语音波形。综合实验表明,USpeech使用与物理数据相当的合成超声数据实现了卓越的性能,显著优于最先进的基于超声的语音增强基线。USpeech在https://github.com/aiot-lab/USpeech/上是开源的。
摘要:Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at https://github.com/aiot-lab/USpeech/.
标题:唱它,讲述它:优质歌词翻译
链接:https://arxiv.org/abs/2410.22066
摘要:音乐剧歌词翻译面临着独特的挑战,因为需要确保高质量的翻译,同时遵守长度和押韵等可唱性要求。现有的歌曲翻译方法往往优先考虑这些可唱性限制,而牺牲了翻译质量,这对音乐剧至关重要。本文旨在提高翻译质量,同时保持关键的可唱性特征。我们的方法包括三个主要组成部分。首先,我们创建一个数据集来训练奖励模型,用于自动评估翻译质量。其次,为了提高可唱性和翻译质量,我们采用过滤技术实现了两阶段训练过程。最后,我们介绍了一个推理时间优化框架,翻译整首歌曲。广泛的实验,包括自动和人工评估,展示了基线方法的显着改进,并验证了我们的方法中每个组件的有效性。
摘要:Translating lyrics for musicals presents unique challenges due to the need to ensure high translation quality while adhering to singability requirements such as length and rhyme. Existing song translation approaches often prioritize these singability constraints at the expense of translation quality, which is crucial for musicals. This paper aims to enhance translation quality while maintaining key singability features. Our method consists of three main components. First, we create a dataset to train reward models for the automatic evaluation of translation quality. Second, to enhance both singability and translation quality, we implement a two-stage training process with filtering techniques. Finally, we introduce an inference-time optimization framework for translating entire songs. Extensive experiments, including both automatic and human evaluations, demonstrate significant improvements over baseline methods and validate the effectiveness of each component in our approach.
标题:CHORDONOMICON:包含666,000首歌曲及其和弦进行的数据集
链接:https://arxiv.org/abs/2410.22046
摘要:和弦进行包含了关于音乐的重要信息,涉及其结构和传达的情感。它们是音乐创作的支柱,在许多情况下,它们是音乐家演奏和跟随音乐所需的唯一信息。尽管它们的重要性,和弦进行作为一个数据域仍然没有得到充分的探索。缺乏适合深度学习应用的大规模数据集,并且探索和弦进行作为输入形式的研究有限。在这项工作中,我们提出了Chordonomicon,一个包含超过666,000首歌曲及其和弦进行的数据集,注释了结构部分,流派和发布日期-通过抓取用户生成的进行和相关元数据的各种来源创建。我们展示了Chordonomicon数据集在分类和生成任务中的实际用途,并讨论了其为研究界提供有价值见解的潜力。和弦进行是独特的,因为它们能够以多种格式(例如文本,图形)表示,并且和弦在给定的上下文中传达丰富的信息,例如它们的和声功能。这些特性使Chordonomicon成为探索高级机器学习技术的理想测试平台,包括Transformers,图形机器学习以及结合知识表示和机器学习的混合系统。
摘要:Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.
标题:半监督自我学习增强音乐情感识别
链接:https://arxiv.org/abs/2410.21897
摘要:音乐情感识别(MER)的目的是识别给定音乐作品中传达的情感。但目前在MER领域,可用的公共数据集的样本量有限。最近,已经提出了用于情感相关任务的基于段的方法,其在较短的段而不是整个音频片段上训练骨干网络,从而自然地增强训练样本而不需要额外的资源。然后,预测的片段级结果被聚合以获得整个歌曲预测。最常用的方法是片段继承包含它的片段的标签,但音乐情感在整个片段中并不恒定。这样做会引入标签噪声,并使训练很容易过拟合。为了处理噪声标签问题,我们提出了一种半监督自学习(SSSL)方法,该方法可以以自学习的方式区分具有正确和不正确标签的样本,从而有效地利用增强的片段级数据。在三个公开的情感数据集上的实验表明,该方法可以获得更好的或相当的性能。
摘要:Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. But currently in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training overfit easily. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.
标题:音频指纹技术在实时可扩展语音检索和语音数字化中的应用
链接:https://arxiv.org/abs/2410.21876
摘要:近年来,音频指纹技术取得了很大的进步,即使在被查询的音频样本已经高度恶化或在噪声条件下记录的条件下,也能够实现准确和快速的音频检索。可以预料的是,大多数现有的工作都是围绕音乐进行的,流行的音乐识别服务,如苹果的Shazam或谷歌的Now Playing,是为移动设备上的个人音频识别而设计的。然而,语音的频谱内容与音乐的频谱内容不同,需要对当前的音频指纹识别方法进行修改。本文为调整现有技术以应对电信和云通信平台中语音检索的专业挑战提供了新的见解。重点是在批处理中实现快速准确的音频检索,而不是促进单个请求,通常在集中式服务器上。此外,本文还演示了如何利用这种方法来支持基于语音转录的音频聚类,而无需进行实际的语音到文本的转换。这种优化可以显著加快处理速度,而无需GPU计算,这是一种通常与最先进的语音转文本工具相关的实时操作要求。
摘要:Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.
标题:RDSinger:用于歌唱声音合成的基于参考的扩散网络
链接:https://arxiv.org/abs/2410.21641
摘要:歌唱声音合成(SVS)旨在从乐谱中产生高保真的歌唱音频,需要对音符,音高和持续时间的详细理解,这与文本到语音的任务不同。尽管扩散模型在图像和视频创建等各种生成任务中表现出出色的性能,但它们在SVS中的应用受到时间复杂性和捕获声学特征(特别是在音调过渡期间)的挑战的阻碍。一些网络从先验分布中学习,并在扩散模型中使用压缩的潜在状态作为更好的开始,但去噪步骤并不能在整个持续时间内始终如一地提高质量。我们介绍了RDSinger,这是一种基于参考的去噪扩散网络,可以为SVS任务生成高质量的音频。我们的方法受到Animate Anyone的启发,Animate Anyone是一个扩散图像网络,可以从参考图像中维护复杂的外观特征。RDSinger利用FastSpeech 2 mel-频谱图作为参考,以减轻去噪步骤伪影。此外,现有的模型可能会受到影响的误导信息压缩的潜在状态在音高过渡。我们通过对部分参考梅尔频谱图应用高斯模糊并调整这些区域的损失权重来解决这个问题。广泛的消融研究证明了我们的方法的效率。在中国歌唱数据集OpenCpop上的测试表明,RDSinger在性能上优于目前最先进的SVS方法。
摘要:Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores, requiring a detailed understanding of notes, pitch, and duration, unlike text-to-speech tasks. Although diffusion models have shown exceptional performance in various generative tasks like image and video creation, their application in SVS is hindered by time complexity and the challenge of capturing acoustic features, particularly during pitch transitions. Some networks learn from the prior distribution and use the compressed latent state as a better start in the diffusion model, but the denoising step doesn't consistently improve quality over the entire duration. We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks. Our approach is inspired by Animate Anyone, a diffusion image network that maintains intricate appearance features from reference images. RDSinger utilizes FastSpeech2 mel-spectrogram as a reference to mitigate denoising step artifacts. Additionally, existing models could be influenced by misleading information on the compressed latent state during pitch transitions. We address this issue by applying Gaussian blur on partial reference mel-spectrogram and adjusting loss weights in these regions. Extensive ablation studies demonstrate the efficiency of our method. Evaluations on OpenCpop, a Chinese singing dataset, show that RDSinger outperforms current state-of-the-art SVS methods in performance.
标题:利用卷积神经网络对低特征谱图进行音频分类
链接:https://arxiv.org/abs/2410.21561
备注:None
摘要:现代音频信号分类技术缺乏对频谱时间频率数据表示形式的低特征音频信号进行分类的能力。此外,目前使用的技术依赖于通常不代表真实世界分布的完全不同的数据集。本文推导了几种首创的机器学习方法,用于分析这些低特征音频频谱图,这些数据分布可能具有归一化,偏斜甚至有限的训练集。特别是,本文提出了几种新的定制卷积架构,使用二进制,一类和siamese方法提取识别特征,以识别给定音频信号的频谱特征。利用这些新的卷积架构以及提出的分类方法,这些实验证明了最先进的分类精度和比传统音频分类方法更高的效率。
摘要:Modern day audio signal classification techniques lack the ability to classify low feature audio signals in the form of spectrographic temporal frequency data representations. Additionally, currently utilized techniques rely on full diverse data sets that are often not representative of real-world distributions. This paper derives several first-of-its-kind machine learning methodologies to analyze these low feature audio spectrograms given data distributions that may have normalized, skewed, or even limited training sets. In particular, this paper proposes several novel customized convolutional architectures to extract identifying features using binary, one-class, and siamese approaches to identify the spectrographic signature of a given audio signal. Utilizing these novel convolutional architectures as well as the proposed classification methods, these experiments demonstrate state-of-the-art classification accuracy and improved efficiency than traditional audio classification methods.
标题:一种基于Score-CAM的新型降噪器,用于无需基本真相的光谱特征提取
链接:https://arxiv.org/abs/2410.21557
备注:None
摘要:基于声纳的音频分类技术是水下声学领域中一个不断发展的研究领域。通常,被动声纳传感器拾取的水下噪声包含所有类型的信号,这些信号在海洋中传播并转换为光谱图像。因此,旨在显示某个对象的时间频率数据的对应频谱图通常包括可以有效地干扰“接触”的大量外来噪声的音调区域。因此,从水下音频信号中提取的大多数频谱样本由于其杂波而变得不可用,并且缺乏不同对象之间所需的不可分辨性。由于用于监督训练的干净真实数据有限,因此为这些音频信号创建分类模型是非常困难的。 本文推导出几种新的技术来解决这个问题,通过开发一种新的基于分数CAM的去噪器,从嘈杂的光谱数据中提取对象的签名,而不需要任何地面真实数据。特别是,本文提出了一种新的生成对抗网络架构,用于学习和生成与低特征谱图输入分布相似的谱图训练数据。此外,本文还提出了一种可推广的基于类激活映射的去噪算法,该算法适用于不同的声学数据分布,甚至适用于真实世界的数据分布。利用这些新颖的架构和提出的去噪技术,这些实验证明了最先进的降噪精度和比当前音频分类标准更高的分类精度。因此,这种方法不仅适用于音频数据,而且适用于世界各地用于机器学习的无数数据分布。
摘要:Sonar based audio classification techniques are a growing area of research in the field of underwater acoustics. Usually, underwater noise picked up by passive sonar transducers contains all types of signals that travel through the ocean and is transformed into spectrographic images. As a result, the corresponding spectrograms intended to display the temporal-frequency data of a certain object often include the tonal regions of abundant extraneous noise that can effectively interfere with a 'contact'. So, a majority of spectrographic samples extracted from underwater audio signals are rendered unusable due to their clutter and lack the required indistinguishability between different objects. With limited clean true data for supervised training, creating classification models for these audio signals is severely bottlenecked. This paper derives several new techniques to combat this problem by developing a novel Score-CAM based denoiser to extract an object's signature from noisy spectrographic data without being given any ground truth data. In particular, this paper proposes a novel generative adversarial network architecture for learning and producing spectrographic training data in similar distributions to low-feature spectrogram inputs. In addition, this paper also a generalizable class activation mapping based denoiser for different distributions of acoustic data, even real-world data distributions. Utilizing these novel architectures and proposed denoising techniques, these experiments demonstrate state-of-the-art noise reduction accuracy and improved classification accuracy than current audio classification standards. As such, this approach has applications not only to audio data but for countless data distributions used all around the world for machine learning.
标题:使用离散语义单元增强希伯来语的TTC稳定性
链接:https://arxiv.org/abs/2410.21502
摘要:这项研究介绍了一种改进的方法,文本到语音(TTS)的生成,显着提高跨语言的采样稳定性,特别是对希伯来语。通过利用从自监督模型中获得的具有较高语音相关性的离散语义单元,我们的方法解决了TTS系统中经常遇到的固有不稳定性,特别是那些处理希伯来语等非变音脚本的系统。利用HuBERT代码,我们的模型生成离散表示,优化TTS任务,从而减少对基于变音符号的文本处理的依赖。这一进步不仅简化了语言建模过程,而且提高了鲁棒性,并显示由于语义单元的disentenglement属性的语音输出的可控性。在声码器中嵌入扬声器的包括进一步帮助捕获扬声器的独特的声音特性,有助于合成语音的自然性。我们的实验结果表明,这种方法不仅保持了高性能的希伯来语,但也显示出适应性英语,强调其有效性,提高稳定性的TTS系统普遍。我们的方法,名为LOTHM(希伯来人的语言),在稳定性方面优于现有的方法,同时实现与以前的方法相当的自然度和说话人相似性,使其成为未来语音合成应用的一个引人注目的选择。样品可以在我们的网页pages.cs.huji.ac.il/adiyoss-lab/LoTHM上找到。
摘要:This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the language modeling process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique vocal characteristics of the speaker, contributing to the naturalness of the synthesized speech. Our experimental results demonstrate that this approach not only maintains high performance in Hebrew but also shows adaptability to English, underscoring its effectiveness in enhancing stability in TTS systems universally. Our method, named LOTHM (Language of The Hebrew Man), outperforms existing methods in terms of stability while achieving naturalness and speaker similarity on par with previous methods, making it a compelling choice for future speech synthesis applications. Samples can be found in our page pages.cs.huji.ac.il/adiyoss-lab/LoTHM .
标题:语音通信早期媒体实时分类的知识提炼
链接:https://arxiv.org/abs/2410.21478
摘要:本文研究了在语音呼叫初始化阶段交换的早期媒体的实时分类的工业设置。我们探讨了最先进的音频标记模型的应用,并强调了应用于早期媒体分类时的一些局限性。虽然大多数现有方法都利用卷积神经网络,但我们提出了一种基于梯度提升树的低资源需求的新方法。我们的方法不仅表现出显着的改善运行时的性能,但也表现出相当的准确性。我们表明,利用知识蒸馏和类聚合技术,训练一个更简单,更小的模型加速语音通话中的早期媒体的分类。我们提供了一个专有的和公开的数据集上的结果的详细分析,关于准确性和运行时性能。我们还报告了在印度的一个区域数据中心实现性能改进的案例研究。
摘要:This paper investigates the industrial setting of real-time classification of early media exchanged during the initialization phase of voice calls. We explore the application of state-of-the-art audio tagging models and highlight some limitations when applied to the classification of early media. While most existing approaches leverage convolutional neural networks, we propose a novel approach for low-resource requirements based on gradient-boosted trees. Our approach not only demonstrates a substantial improvement in runtime performance, but also exhibits a comparable accuracy. We show that leveraging knowledge distillation and class aggregation techniques to train a simpler and smaller model accelerates the classification of early media in voice calls. We provide a detailed analysis of the results on a proprietary and publicly available dataset, regarding accuracy and runtime performance. We additionally report a case study of the achieved performance improvements at a regional data center in India.
标题:制作人与说唱歌手:谁主导了嘻哈音乐?为例
链接:https://arxiv.org/abs/2410.21297
备注:many SOMs
摘要:在嘻哈音乐中,说唱歌手和制作人扮演着重要但不同的角色。然而,两者都有助于整体声音,因为说唱歌手带来了他们的声音,而制作人则负责音乐的创作和混音。在本案例报告中,我们使用Dre博士,Rick Rubin和Timbaland制作的歌曲训练自组织映射(SOM),使用测角器和Mel频率倒谱系数(MFCC)。有了这些地图,我们调查嘻哈制作人是否有一个独特的声音配置文件。然后,我们测试与说唱歌手Eminem,Jay-Z,LL Cool J和Nas的合作是否坚持或打破这种声音配置文件。由于这些说唱歌手也是一些歌曲的制作人,我们调查了他们的声音轮廓受到了介绍他们制作节拍的制作人的影响。结果表明:生产商有自己的声音轮廓,这是独特的关于测角器,和不太明显的关于MFCC。他们主宰了嘻哈音乐的声音超过说唱歌手,谁模仿谁介绍了他们的节拍制作生产者的声音配置文件。
摘要:In hip-hop music, rappers and producers play important, but rather different roles. However, both contribute to the overall sound, as rappers bring in their voice, while producers are responsible for the music composition and mix. In this case report, we trained Self-Organizing Maps (SOMs) with songs produced by Dr. Dre, Rick Rubin and Timbaland using the goniometer and Mel Frequency Cepstral Coefficients (MFCCs). With these maps, we investigate whether hip hop producers have a unique sound profile. Then, we test whether collaborations with the rappers Eminem, Jay-Z, LL Cool J and Nas stick to, or break out of this sound profile. As these rappers are also producers of some songs, we investigate how much their sound profile is influenced by the producers who introduced them to beat making. The results speak a clear language: producers have their own sound profile that is unique concerning the goniometer, and less distinct concerning MFCCs. They dominate the sound of hip hop music over rappers, who emulate the sound profile of the producers who introduced them to beat making.
标题:无需模型训练的无监督异常声音检测和字幕检索增强方法
链接:https://arxiv.org/abs/2410.22056
摘要:提出了一种无监督的异常声音检测方法,并对检测原因进行了说明。虽然有一种方法可以说明给定的正常和异常声音对之间的差异,但假设它是从UASD模型中单独训练和使用的。因此,所获得的说明可以与UASD模型捕获的差异无关。此外,它需要许多表示异常和正常声音之间差异的字幕标签用于模型训练。所提出的方法采用检索增强的方法为异常声音的字幕。在预训练CLAP(对比语言-音频预训练)模型输出的嵌入空间中进行差分字幕,使得异常声音检测结果与字幕一致,且不需要训练。基于主观评价和输出字幕样本分析的实验证明了该方法的有效性。
摘要:This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be trained and used separately from the UASD model. Therefore, the obtained caption can be irrelevant to the differences that the UASD model captured. In addition, it requires many caption labels representing differences between anomalous and normal sounds for model training. The proposed method employs a retrieval-augmented approach for captioning of anomalous sounds. Difference captioning in the embedding space output by the pre-trained CLAP (contrastive language-audio pre-training) model makes the anomalous sound detection results consistent with the captions and does not require training. Experiments based on subjective evaluation and a sample-wise analysis of the output captions demonstrate the effectiveness of the proposed method.
标题:异常声音检测中的音色差异捕捉
链接:https://arxiv.org/abs/2410.22033
摘要:本文提出了一个框架,解释异常机器声音的异常声音检测~(ASD)的背景下。虽然ASD已被广泛探索,但识别异常声音与正常声音的区别也有利于机器状态监测。然而,现有的声音差异字幕方法需要用于训练的异常声音,这在这样的声音不可用的典型机器状况监视设置中是不切实际的。为了解决这个问题,我们提出了一种新的策略来解释不需要异常声音训练的异常差异。具体来说,我们引入了一个框架,解释预定义的音色属性,而不是使用自由形式的文本标题的差异。可以使用通过心理声学研究开发的音色模型来计算音色属性的客观度量,从而能够在不训练机器学习模型的情况下估计音色属性如何以及从正常声音改变了什么。此外,为了准确地确定音色差异,而不管正常训练数据的变化,我们开发了一种方法,该方法基于音频嵌入空间中的k-最近邻方法联合进行异常声音检测和音色差异估计。使用MIMII DG数据集的评估证明了所提出的方法的有效性。
摘要:This paper proposes a framework of explaining anomalous machine sounds in the context of anomalous sound detection~(ASD). While ASD has been extensively explored, identifying how anomalous sounds differ from normal sounds is also beneficial for machine condition monitoring. However, existing sound difference captioning methods require anomalous sounds for training, which is impractical in typical machine condition monitoring settings where such sounds are unavailable. To solve this issue, we propose a new strategy for explaining anomalous differences that does not require anomalous sounds for training. Specifically, we introduce a framework that explains differences in predefined timbre attributes instead of using free-form text captions. Objective metrics of timbre attributes can be computed using timbral models developed through psycho-acoustical research, enabling the estimation of how and what timbre attributes have changed from normal sounds without training machine learning models. Additionally, to accurately determine timbre differences regardless of variations in normal training data, we developed a method that jointly conducts anomalous sound detection and timbre difference estimation based on a k-nearest neighbors method in an audio embedding space. Evaluation using the MIMII DG dataset demonstrated the effectiveness of the proposed method.
标题:通过推测解码实现快速、高质量的自回归语音合成
链接:https://arxiv.org/abs/2410.21951
备注:5 pages, 3 figures, 3 tables. Submitted to ICASSP 2025
摘要:自回归结构,如GPT,被广泛用于现代文语转换(TTS)系统。然而,它会产生大量的推理时间,特别是由于在下一个令牌预测的挑战所提出的冗长的语音令牌序列。在这项工作中,我们介绍了VADUSA,第一种方法来加速自回归TTS通过投机解码。我们的研究结果表明,VADUSA不仅显着提高了推理速度,但也提高了性能,将草案头预测未来的语音内容自回归。此外,在采样期间包含容差机制可以在不影响质量的情况下加速推断。我们的方法在大型数据集和各种类型的语音标记中表现出很强的泛化能力。
摘要:The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
标题:具有源分离模型的异常声音检测系统的表象学习
链接:https://arxiv.org/abs/2410.21797
备注:DCASE 2024 workshop published
摘要:由于难以概括异常声学模式,机械操作中异常声音的检测提出了一个重大挑战。考虑到与全面异常声学数据的获取相关联的复杂性,该任务通常作为无监督学习或新颖性检测问题来处理。用于训练异常声音检测系统的常规方法主要采用自动编码器架构或具有辅助任务的代表性学习。然而,这两种方法都有固有的局限性。自动编码器结构被限制为仅利用目标机器的操作声音,而使用辅助任务进行训练,尽管能够结合不同的声学输入,但可能会产生与异常条件的特征声学签名缺乏相关性的表示。我们提出了一种基于源分离模型(CMGAN)的训练方法,其目的是从目标和非目标类声学信号的混合物中分离出非目标机器声音。这种方法能够有效利用不同的机器声音,并有助于在有限的样本量下训练复杂的神经网络架构。我们的实验结果表明,所提出的方法产生更好的性能相比,传统的自动编码器训练方法和源分离技术,专注于隔离目标机器信号。此外,我们的实验结果表明,所提出的方法表现出增强的表示学习的潜力,作为非目标数据的数量增加,即使在保持恒定的目标类数据量。
摘要:The detection of anomalous sounds in machinery operation presents a significant challenge due to the difficulty in generalizing anomalous acoustic patterns. This task is typically approached as an unsupervised learning or novelty detection problem, given the complexities associated with the acquisition of comprehensive anomalous acoustic data. Conventional methodologies for training anomalous sound detection systems primarily employ auto-encoder architectures or representational learning with auxiliary tasks. However, both approaches have inherent limitations. Auto-encoder structures are constrained to utilizing only the target machine's operational sounds, while training with auxiliary tasks, although capable of incorporating diverse acoustic inputs, may yield representations that lack correlation with the characteristic acoustic signatures of anomalous conditions. We propose a training method based on the source separation model (CMGAN) that aims to isolate non-target machine sounds from a mixture of target and non-target class acoustic signals. This approach enables the effective utilization of diverse machine sounds and facilitates the training of complex neural network architectures with limited sample sizes. Our experimental results demonstrate that the proposed method yields better performance compared to both conventional auto-encoder training approaches and source separation techniques that focus on isolating target machine signals. Moreover, our experimental results demonstrate that the proposed method exhibits the potential for enhanced representation learning as the quantity of non-target data increases, even while maintaining a constant volume of target class data.
标题:临床语音人工智能开发综述:从数据收集到模型验证
链接:https://arxiv.org/abs/2410.21640
备注:76 pages, 24 figures
摘要:人们对利用言语作为各种疾病的健康标志感兴趣。其基本前提是,任何影响语音产生的神经,心理或身体缺陷都可以通过语音自动分析进行客观评估。用于诊断和跟踪心理健康、认知和运动障碍的基于语音的人工智能(AI)模型的最新进展通常使用监督学习,类似于识别和验证等主流语音技术。然而,临床语音AI面临着独特的挑战,包括需要特定的启发任务,小的可用数据集,多样化的语音表示和不确定的诊断标签。因此,标准监督学习范式的应用可能会导致模型在受控环境中表现良好,但在现实世界的临床部署中无法推广。考虑到转化为现实世界的临床场景,本教程概述了临床语音AI稳健开发所需的关键组件。具体而言,本文将涵盖最适合不同临床条件的语音启发任务和协议的设计,数据收集和硬件验证,旨在测量感兴趣的临床结构的语音表示的开发和验证,可靠和强大的临床预测模型的开发,以及临床语音AI的伦理和参与者考虑。我们的目标是为构建模型提供全面的指导,这些模型的输入和输出链接到语音的更可解释和临床意义的方面,可以在临床数据集上进行询问和临床验证,并且通过设计遵守伦理,隐私和安全考虑。
摘要:There has been a surge of interest in leveraging speech as a marker of health for a wide spectrum of conditions. The underlying premise is that any neurological, mental, or physical deficits that impact speech production can be objectively assessed via automated analysis of speech. Recent advances in speech-based Artificial Intelligence (AI) models for diagnosing and tracking mental health, cognitive, and motor disorders often use supervised learning, similar to mainstream speech technologies like recognition and verification. However, clinical speech AI has distinct challenges, including the need for specific elicitation tasks, small available datasets, diverse speech representations, and uncertain diagnostic labels. As a result, application of the standard supervised learning paradigm may lead to models that perform well in controlled settings but fail to generalize in real-world clinical deployments. With translation into real-world clinical scenarios in mind, this tutorial paper provides an overview of the key components required for robust development of clinical speech AI. Specifically, this paper will cover the design of speech elicitation tasks and protocols most appropriate for different clinical conditions, collection of data and verification of hardware, development and validation of speech representations designed to measure clinical constructs of interest, development of reliable and robust clinical prediction models, and ethical and participant considerations for clinical speech AI. The goal is to provide comprehensive guidance on building models whose inputs and outputs link to the more interpretable and clinically meaningful aspects of speech, that can be interrogated and clinically validated on clinical datasets, and that adhere to ethical, privacy, and security considerations by design.
链接:https://arxiv.org/abs/2410.22271
摘要:本报告介绍了我们提交的DCASE 2024任务3挑战的系统:音频和视听声音事件定位和检测与源距离估计(轨道B)。我们的主要模型基于视听(AV)Conformer,它分别处理使用ResNet 50和SELD预训练的音频编码器提取的视频和音频嵌入。该模型的性能远远优于STARSS 23数据集开发集的视听基线,DOAE减半,F1提高了3倍以上。我们的第二个系统从AV-一致性的输出执行时间合奏。然后,我们使用距离估计的特征扩展了模型,例如从全向音频通道中提取的直接和混响信号分量,以及从视频帧中提取的深度图。虽然新系统将我们以前模型的RDE提高了约3个百分点,但它的F1得分较低。这可能是由很少出现在训练集中的声音类别引起的,并且更复杂的系统无法检测到,因为分析可以确定。为了克服这个问题,我们的第四个也是最后一个系统由一个集合策略组成,该策略结合了其他三个系统的预测。在未来的消融实验中,可以测试许多改进系统和训练策略的机会,并可能实现该视听任务的增量性能增益。
摘要:This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.
标题:无需模型训练的无监督异常声音检测和字幕检索增强方法
链接:https://arxiv.org/abs/2410.22056
摘要:提出了一种无监督的异常声音检测方法,并对检测原因进行了说明。虽然有一种方法可以说明给定的正常和异常声音对之间的差异,但假设它是从UASD模型中单独训练和使用的。因此,所获得的说明可以与UASD模型捕获的差异无关。此外,它需要许多表示异常和正常声音之间差异的字幕标签用于模型训练。所提出的方法采用检索增强的方法为异常声音的字幕。在预训练CLAP(对比语言-音频预训练)模型输出的嵌入空间中进行差分字幕,使得异常声音检测结果与字幕一致,且不需要训练。基于主观评价和输出字幕样本分析的实验证明了该方法的有效性。
摘要:This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be trained and used separately from the UASD model. Therefore, the obtained caption can be irrelevant to the differences that the UASD model captured. In addition, it requires many caption labels representing differences between anomalous and normal sounds for model training. The proposed method employs a retrieval-augmented approach for captioning of anomalous sounds. Difference captioning in the embedding space output by the pre-trained CLAP (contrastive language-audio pre-training) model makes the anomalous sound detection results consistent with the captions and does not require training. Experiments based on subjective evaluation and a sample-wise analysis of the output captions demonstrate the effectiveness of the proposed method.
标题:异常声音检测中的音色差异捕捉
链接:https://arxiv.org/abs/2410.22033
摘要:本文提出了一个框架,解释异常机器声音的异常声音检测~(ASD)的背景下。虽然ASD已被广泛探索,但识别异常声音与正常声音的区别也有利于机器状态监测。然而,现有的声音差异字幕方法需要用于训练的异常声音,这在这样的声音不可用的典型机器状况监视设置中是不切实际的。为了解决这个问题,我们提出了一种新的策略来解释不需要异常声音训练的异常差异。具体来说,我们引入了一个框架,解释预定义的音色属性,而不是使用自由形式的文本标题的差异。可以使用通过心理声学研究开发的音色模型来计算音色属性的客观度量,从而能够在不训练机器学习模型的情况下估计音色属性如何以及从正常声音改变了什么。此外,为了准确地确定音色差异,而不管正常训练数据的变化,我们开发了一种方法,该方法基于音频嵌入空间中的k-最近邻方法联合进行异常声音检测和音色差异估计。使用MIMII DG数据集的评估证明了所提出的方法的有效性。
摘要:This paper proposes a framework of explaining anomalous machine sounds in the context of anomalous sound detection~(ASD). While ASD has been extensively explored, identifying how anomalous sounds differ from normal sounds is also beneficial for machine condition monitoring. However, existing sound difference captioning methods require anomalous sounds for training, which is impractical in typical machine condition monitoring settings where such sounds are unavailable. To solve this issue, we propose a new strategy for explaining anomalous differences that does not require anomalous sounds for training. Specifically, we introduce a framework that explains differences in predefined timbre attributes instead of using free-form text captions. Objective metrics of timbre attributes can be computed using timbral models developed through psycho-acoustical research, enabling the estimation of how and what timbre attributes have changed from normal sounds without training machine learning models. Additionally, to accurately determine timbre differences regardless of variations in normal training data, we developed a method that jointly conducts anomalous sound detection and timbre difference estimation based on a k-nearest neighbors method in an audio embedding space. Evaluation using the MIMII DG dataset demonstrated the effectiveness of the proposed method.
标题:通过推测解码实现快速、高质量的自回归语音合成
链接:https://arxiv.org/abs/2410.21951
备注:5 pages, 3 figures, 3 tables. Submitted to ICASSP 2025
摘要:自回归结构,如GPT,被广泛用于现代文语转换(TTS)系统。然而,它会产生大量的推理时间,特别是由于在下一个令牌预测的挑战所提出的冗长的语音令牌序列。在这项工作中,我们介绍了VADUSA,第一种方法来加速自回归TTS通过投机解码。我们的研究结果表明,VADUSA不仅显着提高了推理速度,但也提高了性能,将草案头预测未来的语音内容自回归。此外,在采样期间包含容差机制可以在不影响质量的情况下加速推断。我们的方法在大型数据集和各种类型的语音标记中表现出很强的泛化能力。
摘要:The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
标题:具有源分离模型的异常声音检测系统的表象学习
链接:https://arxiv.org/abs/2410.21797
备注:DCASE 2024 workshop published
摘要:由于难以概括异常声学模式,机械操作中异常声音的检测提出了一个重大挑战。考虑到与全面异常声学数据的获取相关联的复杂性,该任务通常作为无监督学习或新颖性检测问题来处理。用于训练异常声音检测系统的常规方法主要采用自动编码器架构或具有辅助任务的代表性学习。然而,这两种方法都有固有的局限性。自动编码器结构被限制为仅利用目标机器的操作声音,而使用辅助任务进行训练,尽管能够结合不同的声学输入,但可能会产生与异常条件的特征声学签名缺乏相关性的表示。我们提出了一种基于源分离模型(CMGAN)的训练方法,其目的是从目标和非目标类声学信号的混合物中分离出非目标机器声音。这种方法能够有效利用不同的机器声音,并有助于在有限的样本量下训练复杂的神经网络架构。我们的实验结果表明,与传统的自动编码器训练方法和专注于隔离目标机器信号的源分离技术相比,所提出的方法具有更好的性能。此外,我们的实验结果表明,所提出的方法表现出增强的表示学习的潜力,作为非目标数据的数量增加,即使在保持恒定的目标类数据量。
摘要:The detection of anomalous sounds in machinery operation presents a significant challenge due to the difficulty in generalizing anomalous acoustic patterns. This task is typically approached as an unsupervised learning or novelty detection problem, given the complexities associated with the acquisition of comprehensive anomalous acoustic data. Conventional methodologies for training anomalous sound detection systems primarily employ auto-encoder architectures or representational learning with auxiliary tasks. However, both approaches have inherent limitations. Auto-encoder structures are constrained to utilizing only the target machine's operational sounds, while training with auxiliary tasks, although capable of incorporating diverse acoustic inputs, may yield representations that lack correlation with the characteristic acoustic signatures of anomalous conditions. We propose a training method based on the source separation model (CMGAN) that aims to isolate non-target machine sounds from a mixture of target and non-target class acoustic signals. This approach enables the effective utilization of diverse machine sounds and facilitates the training of complex neural network architectures with limited sample sizes. Our experimental results demonstrate that the proposed method yields better performance compared to both conventional auto-encoder training approaches and source separation techniques that focus on isolating target machine signals. Moreover, our experimental results demonstrate that the proposed method exhibits the potential for enhanced representation learning as the quantity of non-target data increases, even while maintaining a constant volume of target class data.
标题:临床语音人工智能开发综述:从数据收集到模型验证
链接:https://arxiv.org/abs/2410.21640
备注:76 pages, 24 figures
摘要:人们对利用言语作为各种疾病的健康标志感兴趣。其基本前提是,任何影响语音产生的神经,心理或身体缺陷都可以通过语音自动分析进行客观评估。用于诊断和跟踪心理健康、认知和运动障碍的基于语音的人工智能(AI)模型的最新进展通常使用监督学习,类似于识别和验证等主流语音技术。然而,临床语音AI面临着独特的挑战,包括需要特定的启发任务,小的可用数据集,多样化的语音表示和不确定的诊断标签。因此,标准监督学习范式的应用可能会导致模型在受控环境中表现良好,但在现实世界的临床部署中无法推广。考虑到转化为现实世界的临床场景,本教程概述了临床语音AI稳健开发所需的关键组件。具体而言,本文将涵盖最适合不同临床条件的语音启发任务和协议的设计,数据收集和硬件验证,旨在测量感兴趣的临床结构的语音表示的开发和验证,可靠和强大的临床预测模型的开发,以及临床语音AI的伦理和参与者考虑。我们的目标是为构建模型提供全面的指导,这些模型的输入和输出链接到语音的更可解释和临床意义的方面,可以在临床数据集上进行询问和临床验证,并且通过设计遵守伦理,隐私和安全考虑。
摘要:There has been a surge of interest in leveraging speech as a marker of health for a wide spectrum of conditions. The underlying premise is that any neurological, mental, or physical deficits that impact speech production can be objectively assessed via automated analysis of speech. Recent advances in speech-based Artificial Intelligence (AI) models for diagnosing and tracking mental health, cognitive, and motor disorders often use supervised learning, similar to mainstream speech technologies like recognition and verification. However, clinical speech AI has distinct challenges, including the need for specific elicitation tasks, small available datasets, diverse speech representations, and uncertain diagnostic labels. As a result, application of the standard supervised learning paradigm may lead to models that perform well in controlled settings but fail to generalize in real-world clinical deployments. With translation into real-world clinical scenarios in mind, this tutorial paper provides an overview of the key components required for robust development of clinical speech AI. Specifically, this paper will cover the design of speech elicitation tasks and protocols most appropriate for different clinical conditions, collection of data and verification of hardware, development and validation of speech representations designed to measure clinical constructs of interest, development of reliable and robust clinical prediction models, and ethical and participant considerations for clinical speech AI. The goal is to provide comprehensive guidance on building models whose inputs and outputs link to the more interpretable and clinically meaningful aspects of speech, that can be interrogated and clinically validated on clinical datasets, and that adhere to ethical, privacy, and security considerations by design.
标题:通过集成统计混合模型实现会议的同时数字化和分离
链接:https://arxiv.org/abs/2410.21455
备注:Submitted to ICASSP2025
摘要:我们提出了一种方法,同时日记和分离的会议数据。它包括一个复杂的角中心高斯混合模型(cACGMM)的语音源分离,和一个冯-米塞斯-费舍尔混合模型(VMFMM)的联合统计框架中的日记。通过集成,空间和光谱信息都被利用来进行分类和分离。我们还开发了一种方法,用于计算会议的一个部分中的活跃发言者的数量,以支持逐块处理。虽然一次会议的发言者总数可能是已知的,但通常不知道每个部分的发言者总数。通过提出的说话人计数,可以逐段进行联合日记和源分离,并解决了跨段的排列问题,从而允许未来进行块在线处理。LibriCSS会议语料库上的实验结果表明,集成的方法优于级联的方法的日记和语音增强的WER,无论是在每段和每一个会议的水平。
摘要:We propose an approach for simultaneous diarization and separation of meeting data. It consists of a complex Angular Central Gaussian Mixture Model (cACGMM) for speech source separation, and a von-Mises-Fisher Mixture Model (VMFMM) for diarization in a joint statistical framework. Through the integration, both spatial and spectral information are exploited for diarization and separation. We also develop a method for counting the number of active speakers in a segment of a meeting to support block-wise processing. While the total number of speakers in a meeting may be known, it is usually not known on a per-segment level. With the proposed speaker counting, joint diarization and source separation can be done segment-by-segment, and the permutation problem across segments is solved, thus allowing for block-online processing in the future. Experimental results on the LibriCSS meeting corpus show that the integrated approach outperforms a cascaded approach of diarization and speech enhancement in terms of WER, both on a per-segment and on a per-meeting level.
标题:非常细心的Tacotron:基于自回归转换器的文本到语音中的鲁棒且无限长度概括
链接:https://arxiv.org/abs/2410.22179
备注:Submitted to NAACL
摘要:已知基于自回归(AR)变换的序列模型难以推广到比训练期间看到的序列更长的序列。当应用于文本到语音(TTS)时,这些模型往往会丢失或重复单词或产生不稳定的输出,尤其是对于较长的话语。在本文中,我们介绍了增强针对AR变换为基础的编码器-解码器TTS系统,解决这些鲁棒性和长度泛化问题。我们的方法使用对齐机制提供交叉注意操作与相对位置信息。关联的对齐位置通过反向传播作为模型的潜在属性来学习,并且在训练期间不需要外部对齐信息。虽然该方法是针对TTS输入输出对齐的单调性质而定制的,但它仍然能够受益于交织多头自注意和交叉注意操作的灵活建模能力。结合这些改进的系统,我们称之为“非常专注的Tacotron”,与基于T5的TTS系统的自然性和表现力相匹配,同时消除了重复或丢失单词的问题,并能够推广到任何实际的话语长度。
摘要:Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
标题:排名上升:使用辅助排名分类器增强半监督回归
链接:https://arxiv.org/abs/2410.22124
备注:Accepted at NeurIPS 2024 (Poster)
摘要:最先进的(SOTA)半监督学习技术,如FixMatch及其变体,在分类任务中表现出令人印象深刻的性能。然而,这些方法并不直接适用于回归任务。在本文中,我们提出了RankUp,这是一种简单而有效的方法,它可以适应现有的半监督分类技术,以提高回归任务的性能。RankUp通过将原始回归任务转换为排名问题并与原始回归目标同时进行训练来实现这一点。该辅助排序分类器输出分类结果,从而能够与现有的半监督分类方法集成。此外,我们引入了回归分布对齐(RDA),这是一种补充技术,通过分布对齐改进伪标签,进一步增强了RankUp的性能。尽管RankUp很简单,但无论是否有RDA,它都能在一系列回归基准测试中实现SOTA结果,包括计算机视觉、音频和自然语言处理任务。我们的代码和日志数据在https://github.com/pm25/semi-supervised-regression上开源。
摘要:State-of-the-art (SOTA) semi-supervised learning techniques, such as FixMatch and it's variants, have demonstrated impressive performance in classification tasks. However, these methods are not directly applicable to regression tasks. In this paper, we present RankUp, a simple yet effective approach that adapts existing semi-supervised classification techniques to enhance the performance of regression tasks. RankUp achieves this by converting the original regression task into a ranking problem and training it concurrently with the original regression objective. This auxiliary ranking classifier outputs a classification result, thus enabling integration with existing semi-supervised classification methods. Moreover, we introduce regression distribution alignment (RDA), a complementary technique that further enhances RankUp's performance by refining pseudo-labels through distribution alignment. Despite its simplicity, RankUp, with or without RDA, achieves SOTA results in across a range of regression benchmarks, including computer vision, audio, and natural language processing tasks. Our code and log data are open-sourced at https://github.com/pm25/semi-supervised-regression.
标题:USpeech:通过跨模式合成,以最少的人力实现超声波增强语音
链接:https://arxiv.org/abs/2410.22076
摘要:语音增强在人机交互中至关重要,特别是对于无处不在的设备。基于超声波的语音增强由于其优越的普遍性和性能而成为一种有吸引力的选择。然而,在音频超声数据采集过程中,来自意外和非预期来源的不可避免的干扰使得现有解决方案严重依赖于人工进行数据收集和处理。这导致数据严重不足,限制了基于超声波的语音增强的全部潜力。为了解决这个问题,我们提出了USpeech,一个跨模态的超声语音增强合成框架,以最小的人力。其核心是一个两阶段的框架,通过利用可听音频作为桥梁,建立视觉和超声模态之间的对应关系。这种方法克服了缺乏成对的视频超声数据集和视频和超声数据之间的固有异质性的挑战。我们的框架结合了对比视频-音频预训练,将模态投影到共享的语义空间中,并采用音频-超声编码器-解码器进行超声合成。然后,我们提出了一个语音增强网络,增强语音在时间-频率域和恢复干净的语音波形通过神经声码器。综合实验表明,USpeech使用与物理数据相当的合成超声数据实现了卓越的性能,显著优于最先进的基于超声的语音增强基线。USpeech在https://github.com/aiot-lab/USpeech/上是开源的。
摘要:Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at https://github.com/aiot-lab/USpeech/.
标题:唱它,讲述它:优质歌词翻译
链接:https://arxiv.org/abs/2410.22066
摘要:音乐剧歌词翻译面临着独特的挑战,因为需要确保高质量的翻译,同时遵守长度和押韵等可唱性要求。现有的歌曲翻译方法往往优先考虑这些可唱性限制,而牺牲了翻译质量,这对音乐剧至关重要。本文旨在提高翻译质量,同时保持关键的可唱性特征。我们的方法包括三个主要组成部分。首先,我们创建一个数据集来训练奖励模型,用于自动评估翻译质量。其次,为了提高可唱性和翻译质量,我们采用过滤技术实现了两阶段训练过程。最后,我们介绍了一个推理时间优化框架,翻译整首歌曲。广泛的实验,包括自动和人工评估,展示了基线方法的显着改进,并验证了我们的方法中每个组件的有效性。
摘要:Translating lyrics for musicals presents unique challenges due to the need to ensure high translation quality while adhering to singability requirements such as length and rhyme. Existing song translation approaches often prioritize these singability constraints at the expense of translation quality, which is crucial for musicals. This paper aims to enhance translation quality while maintaining key singability features. Our method consists of three main components. First, we create a dataset to train reward models for the automatic evaluation of translation quality. Second, to enhance both singability and translation quality, we implement a two-stage training process with filtering techniques. Finally, we introduce an inference-time optimization framework for translating entire songs. Extensive experiments, including both automatic and human evaluations, demonstrate significant improvements over baseline methods and validate the effectiveness of each component in our approach.
标题:CHORDONOMICON:包含666,000首歌曲及其和弦进行的数据集
链接:https://arxiv.org/abs/2410.22046
摘要:和弦进行包含了关于音乐的重要信息,涉及其结构和传达的情感。它们是音乐创作的支柱,在许多情况下,它们是音乐家演奏和跟随音乐所需的唯一信息。尽管它们的重要性,和弦进行作为一个数据域仍然没有得到充分的探索。缺乏适合深度学习应用的大规模数据集,并且探索和弦进行作为输入形式的研究有限。在这项工作中,我们提出了Chordonomicon,一个包含超过666,000首歌曲及其和弦进行的数据集,注释了结构部分,流派和发布日期-通过抓取用户生成的进行和相关元数据的各种来源创建。我们展示了Chordonomicon数据集在分类和生成任务中的实际用途,并讨论了其为研究界提供有价值见解的潜力。和弦进行是独特的,因为它们能够以多种格式(例如文本,图形)表示,并且和弦在给定的上下文中传达丰富的信息,例如它们的和声功能。这些特征使Chordonomicon成为探索高级机器学习技术的理想测试平台,包括Transformers、图形机器学习以及结合知识表示和机器学习的混合系统。
摘要:Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.
标题:半监督自我学习增强音乐情感识别
链接:https://arxiv.org/abs/2410.21897
摘要:音乐情感识别(MER)的目的是识别给定音乐作品中传达的情感。但目前在MER领域,可用的公共数据集的样本量有限。最近,已经提出了用于情感相关任务的基于段的方法,其在较短的段而不是整个音频片段上训练骨干网络,从而自然地增强训练样本而不需要额外的资源。然后,预测的片段级结果被聚合以获得整个歌曲预测。最常用的方法是片段继承包含它的片段的标签,但音乐情感在整个片段中并不恒定。这样做会引入标签噪声,并使训练很容易过拟合。为了处理噪声标签问题,我们提出了一种半监督自学习(SSSL)方法,该方法可以以自学习的方式区分具有正确和不正确标签的样本,从而有效地利用增强的片段级数据。在三个公开的情感数据集上的实验表明,该方法可以获得更好的或相当的性能。
摘要:Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. But currently in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training overfit easily. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.
标题:音频指纹技术在实时可扩展语音检索和语音数字化中的应用
链接:https://arxiv.org/abs/2410.21876
摘要:近年来,音频指纹技术取得了很大的进步,即使在被查询的音频样本已经高度恶化或在噪声条件下记录的条件下,也能够实现准确和快速的音频检索。可以预料的是,大多数现有的工作都是围绕音乐进行的,流行的音乐识别服务,如苹果的Shazam或谷歌的Now Playing,是为移动设备上的个人音频识别而设计的。然而,语音的频谱内容与音乐的频谱内容不同,需要对当前的音频指纹识别方法进行修改。本文为调整现有技术以应对电信和云通信平台中语音检索的专业挑战提供了新的见解。重点是在批处理中实现快速准确的音频检索,而不是促进单个请求,通常在集中式服务器上。此外,本文还演示了如何利用这种方法来支持基于语音转录的音频聚类,而无需进行实际的语音到文本的转换。这种优化可以显著加快处理速度,而无需GPU计算,这是一种通常与最先进的语音转文本工具相关的实时操作要求。
摘要:Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.
标题:RDSinger:用于歌唱声音合成的基于参考的扩散网络
链接:https://arxiv.org/abs/2410.21641
摘要:歌唱声音合成(SVS)旨在从乐谱中产生高保真的歌唱音频,需要对音符,音高和持续时间的详细理解,这与文本到语音的任务不同。虽然扩散模型在各种生成任务(如图像和视频创建)中表现出出色的性能,但它们在SVS中的应用受到时间复杂性和捕获声学特征的挑战的阻碍,特别是在音高过渡期间。一些网络从先验分布中学习,并在扩散模型中使用压缩的潜在状态作为更好的开始,但去噪步骤并不能在整个持续时间内始终如一地提高质量。我们介绍了RDSinger,这是一种基于参考的去噪扩散网络,可以为SVS任务生成高质量的音频。我们的方法受到Animate Anyone的启发,Animate Anyone是一个扩散图像网络,可以从参考图像中维护复杂的外观特征。RDSinger利用FastSpeech 2 mel-频谱图作为参考,以减轻去噪步骤伪影。此外,现有的模型可能会受到影响的误导信息压缩的潜在状态在音高过渡。我们通过对部分参考梅尔谱图应用高斯模糊并调整这些区域的损失权重来解决这个问题。广泛的消融研究证明了我们的方法的效率。在中国歌唱数据集OpenCpop上的测试表明,RDSinger在性能上优于目前最先进的SVS方法。
摘要:Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores, requiring a detailed understanding of notes, pitch, and duration, unlike text-to-speech tasks. Although diffusion models have shown exceptional performance in various generative tasks like image and video creation, their application in SVS is hindered by time complexity and the challenge of capturing acoustic features, particularly during pitch transitions. Some networks learn from the prior distribution and use the compressed latent state as a better start in the diffusion model, but the denoising step doesn't consistently improve quality over the entire duration. We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks. Our approach is inspired by Animate Anyone, a diffusion image network that maintains intricate appearance features from reference images. RDSinger utilizes FastSpeech2 mel-spectrogram as a reference to mitigate denoising step artifacts. Additionally, existing models could be influenced by misleading information on the compressed latent state during pitch transitions. We address this issue by applying Gaussian blur on partial reference mel-spectrogram and adjusting loss weights in these regions. Extensive ablation studies demonstrate the efficiency of our method. Evaluations on OpenCpop, a Chinese singing dataset, show that RDSinger outperforms current state-of-the-art SVS methods in performance.
标题:利用卷积神经网络对低特征谱图进行音频分类
链接:https://arxiv.org/abs/2410.21561
备注:None
摘要:现代音频信号分类技术缺乏对频谱时间频率数据表示形式的低特征音频信号进行分类的能力。此外,目前使用的技术依赖于通常不代表真实世界分布的完全不同的数据集。本文推导了几种首创的机器学习方法,用于分析这些低特征音频频谱图,这些数据分布可能具有归一化,偏斜甚至有限的训练集。特别是,本文提出了几种新的定制卷积架构,使用二进制,一类和siamese方法提取识别特征,以识别给定音频信号的频谱特征。利用这些新的卷积架构以及提出的分类方法,这些实验证明了最先进的分类精度和比传统音频分类方法更高的效率。
摘要:Modern day audio signal classification techniques lack the ability to classify low feature audio signals in the form of spectrographic temporal frequency data representations. Additionally, currently utilized techniques rely on full diverse data sets that are often not representative of real-world distributions. This paper derives several first-of-its-kind machine learning methodologies to analyze these low feature audio spectrograms given data distributions that may have normalized, skewed, or even limited training sets. In particular, this paper proposes several novel customized convolutional architectures to extract identifying features using binary, one-class, and siamese approaches to identify the spectrographic signature of a given audio signal. Utilizing these novel convolutional architectures as well as the proposed classification methods, these experiments demonstrate state-of-the-art classification accuracy and improved efficiency than traditional audio classification methods.
标题:一种基于Score-CAM的新型降噪器,用于无需基本真相的光谱特征提取
链接:https://arxiv.org/abs/2410.21557
备注:None
摘要:基于声纳的音频分类技术是水下声学领域中一个不断发展的研究领域。通常,被动声纳传感器拾取的水下噪声包含所有类型的信号,这些信号在海洋中传播并转换为光谱图像。因此,旨在显示某个对象的时间频率数据的对应频谱图通常包括可以有效地干扰“接触”的大量外来噪声的音调区域。因此,从水下音频信号中提取的大多数频谱样本由于其杂波而变得不可用,并且缺乏不同对象之间所需的不可分辨性。由于用于监督训练的干净真实数据有限,因此为这些音频信号创建分类模型是非常困难的。 本文推导出几种新的技术来解决这个问题,通过开发一种新的基于分数CAM的去噪器,从嘈杂的光谱数据中提取对象的签名,而不需要任何地面真实数据。特别是,本文提出了一种新的生成对抗网络架构,用于学习和生成与低特征谱图输入分布相似的谱图训练数据。此外,本文还提出了一种可推广的基于类激活映射的去噪算法,该算法适用于不同的声学数据分布,甚至适用于真实世界的数据分布。利用这些新颖的架构和提出的去噪技术,这些实验证明了最先进的降噪精度和比当前音频分类标准更高的分类精度。因此,这种方法不仅适用于音频数据,而且适用于世界各地用于机器学习的无数数据分布。
摘要:Sonar based audio classification techniques are a growing area of research in the field of underwater acoustics. Usually, underwater noise picked up by passive sonar transducers contains all types of signals that travel through the ocean and is transformed into spectrographic images. As a result, the corresponding spectrograms intended to display the temporal-frequency data of a certain object often include the tonal regions of abundant extraneous noise that can effectively interfere with a 'contact'. So, a majority of spectrographic samples extracted from underwater audio signals are rendered unusable due to their clutter and lack the required indistinguishability between different objects. With limited clean true data for supervised training, creating classification models for these audio signals is severely bottlenecked. This paper derives several new techniques to combat this problem by developing a novel Score-CAM based denoiser to extract an object's signature from noisy spectrographic data without being given any ground truth data. In particular, this paper proposes a novel generative adversarial network architecture for learning and producing spectrographic training data in similar distributions to low-feature spectrogram inputs. In addition, this paper also a generalizable class activation mapping based denoiser for different distributions of acoustic data, even real-world data distributions. Utilizing these novel architectures and proposed denoising techniques, these experiments demonstrate state-of-the-art noise reduction accuracy and improved classification accuracy than current audio classification standards. As such, this approach has applications not only to audio data but for countless data distributions used all around the world for machine learning.
标题:使用离散语义单元增强希伯来语的TTC稳定性
链接:https://arxiv.org/abs/2410.21502
摘要:这项研究介绍了一种改进的方法,文本到语音(TTS)的生成,显着提高跨语言的采样稳定性,特别是对希伯来语。通过利用从自监督模型中获得的具有较高语音相关性的离散语义单元,我们的方法解决了TTS系统中经常遇到的固有不稳定性,特别是那些处理希伯来语等非变音脚本的系统。利用HuBERT代码,我们的模型生成离散表示,优化TTS任务,从而减少对基于变音符号的文本处理的依赖。这一进步不仅简化了语言建模过程,而且提高了鲁棒性,并显示由于语义单元的disentenglement属性的语音输出的可控性。在声码器中嵌入扬声器的包括进一步帮助捕获扬声器的独特的声音特性,有助于合成语音的自然性。我们的实验结果表明,这种方法不仅保持了高性能的希伯来语,但也显示出适应性英语,强调其有效性,提高稳定性的TTS系统普遍。我们的方法,名为LOTHM(希伯来人的语言),在稳定性方面优于现有的方法,同时实现与以前的方法相当的自然度和说话人相似性,使其成为未来语音合成应用的一个引人注目的选择。样品可以在我们的网页pages.cs.huji.ac.il/adiyoss-lab/LoTHM上找到。
摘要:This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the language modeling process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique vocal characteristics of the speaker, contributing to the naturalness of the synthesized speech. Our experimental results demonstrate that this approach not only maintains high performance in Hebrew but also shows adaptability to English, underscoring its effectiveness in enhancing stability in TTS systems universally. Our method, named LOTHM (Language of The Hebrew Man), outperforms existing methods in terms of stability while achieving naturalness and speaker similarity on par with previous methods, making it a compelling choice for future speech synthesis applications. Samples can be found in our page pages.cs.huji.ac.il/adiyoss-lab/LoTHM .
标题:语音通信早期媒体实时分类的知识提炼
链接:https://arxiv.org/abs/2410.21478
摘要:本文研究了在语音呼叫初始化阶段交换的早期媒体的实时分类的工业设置。我们探讨了最先进的音频标记模型的应用,并强调了应用于早期媒体分类时的一些局限性。虽然大多数现有方法都利用卷积神经网络,但我们提出了一种基于梯度提升树的低资源需求的新方法。我们的方法不仅表现出显着的改善运行时的性能,但也表现出相当的准确性。我们表明,利用知识蒸馏和类聚合技术,训练一个更简单,更小的模型加速语音通话中的早期媒体的分类。我们提供了一个专有的和公开的数据集上的结果的详细分析,关于准确性和运行时性能。我们还报告了在印度的一个区域数据中心实现性能改进的案例研究。
摘要:This paper investigates the industrial setting of real-time classification of early media exchanged during the initialization phase of voice calls. We explore the application of state-of-the-art audio tagging models and highlight some limitations when applied to the classification of early media. While most existing approaches leverage convolutional neural networks, we propose a novel approach for low-resource requirements based on gradient-boosted trees. Our approach not only demonstrates a substantial improvement in runtime performance, but also exhibits a comparable accuracy. We show that leveraging knowledge distillation and class aggregation techniques to train a simpler and smaller model accelerates the classification of early media in voice calls. We provide a detailed analysis of the results on a proprietary and publicly available dataset, regarding accuracy and runtime performance. We additionally report a case study of the achieved performance improvements at a regional data center in India.
标题:制作人与说唱歌手:谁主导了嘻哈音乐?为例
链接:https://arxiv.org/abs/2410.21297
备注:many SOMs
摘要:在嘻哈音乐中,说唱歌手和制作人扮演着重要但不同的角色。然而,两者都有助于整体声音,因为说唱歌手带来了他们的声音,而制作人则负责音乐的创作和混音。在这个案例报告中,我们使用Dre博士,Rick Rubin和Timbaland制作的歌曲训练自组织映射(SOM),使用测角器和Mel频率倒谱系数(MFCC)。有了这些地图,我们调查嘻哈制作人是否有一个独特的声音配置文件。然后,我们测试与说唱歌手Eminem,Jay-Z,LL Cool J和Nas的合作是否坚持或打破这种声音配置文件。由于这些说唱歌手也是一些歌曲的制作人,我们调查了他们的声音轮廓受到了介绍他们制作节拍的制作人的影响。结果表明:生产商有自己的声音轮廓,这是独特的关于测角器,和不太明显的关于MFCC。他们主宰了嘻哈音乐的声音超过说唱歌手,谁模仿谁介绍了他们的节拍制作生产者的声音配置文件。
摘要:In hip-hop music, rappers and producers play important, but rather different roles. However, both contribute to the overall sound, as rappers bring in their voice, while producers are responsible for the music composition and mix. In this case report, we trained Self-Organizing Maps (SOMs) with songs produced by Dr. Dre, Rick Rubin and Timbaland using the goniometer and Mel Frequency Cepstral Coefficients (MFCCs). With these maps, we investigate whether hip hop producers have a unique sound profile. Then, we test whether collaborations with the rappers Eminem, Jay-Z, LL Cool J and Nas stick to, or break out of this sound profile. As these rappers are also producers of some songs, we investigate how much their sound profile is influenced by the producers who introduced them to beat making. The results speak a clear language: producers have their own sound profile that is unique concerning the goniometer, and less distinct concerning MFCCs. They dominate the sound of hip hop music over rappers, who emulate the sound profile of the producers who introduced them to beat making.