本文经arXiv每日学术速递授权转载
【1】Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
标题:言语不仅仅是言语:言语转文本翻译系统是否利用韵律?
链接:https://arxiv.org/abs/2410.24019
作者:Ioannis Tsiamas, Matthias Sperber, Andrew Finch, Sarthak Garg
备注:WMT 2024
摘要:语音的韵律,包括重音、语调和节奏等特征,会显著地影响潜在的语义,从而也会影响其语篇翻译。然而,韵律很少在语音到文本翻译(S2 TT)系统的背景下进行研究。特别地,端到端(E2 E)系统已经被提出为非常适合韵律感知翻译,因为它们在做出翻译决策时可以直接访问语音信号,但是对这在实践中是否成功的理解仍然有限。一个主要的挑战是在翻译中评估韵律意识的困难。为了应对这一挑战,我们引入了一种评估方法和一个集中的基准(名为ContraProST),旨在捕捉广泛的韵律现象。我们的方法使用大型语言模型和可控的文本到语音(TTS)生成对比的例子。通过将英语语音翻译成德语、西班牙语和日语的实验,我们发现:(a)S2 TT模型具有韵律的一些内部表示,但韵律信号通常不足以影响翻译;(b)E2 E系统优于语音识别和文本翻译系统的级联,证实了它们在这方面的理论优势,以及(c)某些级联系统也捕获翻译中的韵律信息,但仅在较小程度上取决于转录本的表面形式的细节。
摘要:The prosody of a spoken utterance, including features like stress, intonation
and rhythm, can significantly affect the underlying semantics, and as a
consequence can also affect its textual translation. Nevertheless, prosody is
rarely studied within the context of speech-to-text translation (S2TT) systems.
In particular, end-to-end (E2E) systems have been proposed as well-suited for
prosody-aware translation because they have direct access to the speech signal
when making translation decisions, but the understanding of whether this is
successful in practice is still limited. A main challenge is the difficulty of
evaluating prosody awareness in translation. To address this challenge, we
introduce an evaluation methodology and a focused benchmark (named ContraProST)
aimed at capturing a wide range of prosodic phenomena. Our methodology uses
large language models and controllable text-to-speech (TTS) to generate
contrastive examples. Through experiments in translating English speech into
German, Spanish, and Japanese, we find that (a) S2TT models possess some
internal representation of prosody, but the prosody signal is often not strong
enough to affect the translations, (b) E2E systems outperform cascades of
speech recognition and text translation systems, confirming their theoretical
advantage in this regard, and (c) certain cascaded systems also capture
prosodic information in the translation, but only to a lesser extent that
depends on the particulars of the transcript's surface form.
标题:音频是阿喀琉斯之踵:红色团队音频大型多模式
链接:https://arxiv.org/abs/2410.23861
摘要:大型多模态模型(Large Multimodal Models,LLM)已经证明了在现实世界条件下与人类交互的能力,它结合了大型语言模型(Large Language Models,LLM)和模态编码器,将多模态信息(视觉和听觉)与文本对齐。然而,这些模型提出了新的安全挑战,即在文本上安全对齐的模型是否也对多模式输入表现出一致的保障。尽管最近对视觉激光器的安全性进行了研究,但音频激光器的安全性仍未得到充分探讨。在这项工作中,我们在三种设置下全面评估了五种高级音频Lancers的安全性:(i)音频和文本格式的有害问题,(ii)文本格式的有害问题伴随着分散注意力的非语音音频,以及(iii)语音特定的越狱。在这些设置下,我们的结果表明,开源音频Lubricant在有害音频问题上的平均攻击成功率为69.14%,并且在被非语音音频噪声分散注意力时表现出安全漏洞。我们在Gemini-1.5-Pro上的语音越狱在有害查询基准测试中的攻击成功率为70.67%。我们提供了有关可能导致这些报告的安全失调的原因的见解。警告:本文包含令人反感的例子。
摘要:Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining Large Language Models (LLMs) and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.
标题:ISCSLP 2024鼓舞人心且令人信服的音频生成挑战赛的NPU-HWC系统
链接:https://arxiv.org/abs/2410.23815
备注:accepted by ISCSLP 2024
摘要:本文介绍了提交给ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024(ICAGC)的NPU-HWC系统。我们的系统包括两个模块:一个语音发生器的轨道1和背景音频发生器的轨道2。在Track 1中,我们使用Single-Codec将语音标记为离散的标记,并使用基于语言模型的方法来实现zero-shot说话风格克隆。单编解码器在令牌级别有效地简化了音色和说话风格,减少了自回归语言模型上的声学建模负担。此外,我们使用DSPGAN将16 kHz梅尔频谱图上采样为高保真48 kHz波形。在第二部分中,我们提出了一个基于大型语言模型(LLM)的背景音频生成器。该系统产生适合场景的伴奏描述,与Tango 2合成背景音频,并将其与我们的Track 1系统生成的语音集成。我们的作品分别在第一和第二轨道中获得第二名和第一名。
摘要:This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.
标题:通过调和/冲击源分离和卷积神经网络改善有限数据集下的打鼾检测
链接:https://arxiv.org/abs/2410.23796
备注:None
摘要:打鼾是阻塞性睡眠呼吸暂停综合征(OSAS)患者中常见的声学生物标志物,对诊断和监测这种公认的临床疾病具有重要意义。无论打鼾类型如何,大多数打鼾情况都表现出可识别的谐波模式,这些模式通过随时间的不同能量分布表现出来。在这项工作中,我们提出了一种新的方法来区分单耳打鼾从非打鼾的声音,通过分析谐波含量的输入声音,使用谐波/谐波声源分离(HPSS)。基于HPSS的谐波频谱图,所得到的特征被用作传统神经网络架构的输入数据,旨在即使在有限的数据学习框架下也能提高打鼾检测性能。为了评估我们的建议的性能,我们研究了两种不同的情况:1)使用打鼾和干扰声音的大型数据集,以及2)使用由大约1%的数据材料组成的精简训练集。在前一种情况下,建议HPSS为基础的功能提供了竞争力的结果相比,其他输入功能从文献。然而,所提出的方法的关键优势在于在有限的数据学习环境中从HPSS得到的谐波谱图的优异性能。在这种特定的情况下,使用所提出的谐波功能显着提高了所有研究的架构相比,在现有文献中记载的经典输入功能的性能。这一发现清楚地表明,结合谐波内容能够更可靠地学习大多数打鼾声音中普遍存在的基本时频特性,即使在训练数据量有限的情况下也是如此。
摘要:Snoring, an acoustic biomarker commonly observed in individuals with Obstructive Sleep Apnoea Syndrome (OSAS), holds significant potential for diagnosing and monitoring this recognized clinical disorder. Irrespective of snoring types, most snoring instances exhibit identifiable harmonic patterns manifested through distinctive energy distributions over time. In this work, we propose a novel method to differentiate monaural snoring from non-snoring sounds by analyzing the harmonic content of the input sound using harmonic/percussive sound source separation (HPSS). The resulting feature, based on the harmonic spectrogram from HPSS, is employed as input data for conventional neural network architectures, aiming to enhance snoring detection performance even under a limited data learning framework. To evaluate the performance of our proposal, we studied two different scenarios: 1) using a large dataset of snoring and interfering sounds, and 2) using a reduced training set composed of around 1% of the data material. In the former scenario, the proposed HPSS-based feature provides competitive results compared to other input features from the literature. However, the key advantage of the proposed method lies in the superior performance of the harmonic spectrogram derived from HPSS in a limited data learning context. In this particular scenario, using the proposed harmonic feature significantly enhances the performance of all the studied architectures in comparison to the classical input features documented in the existing literature. This finding clearly demonstrates that incorporating harmonic content enables more reliable learning of the essential time-frequency characteristics that are prevalent in most snoring sounds, even in scenarios where the amount of training data is limited.
标题:Neurobench:XyloAudio 2上的DUSE 2020声学场景分类基准
链接:https://arxiv.org/abs/2410.23776
摘要:XyloAudio是一系列超低功耗音频推理芯片,专为实时能量受限场景中的麦克风内和麦克风附近音频分析而设计。Xylo是围绕一个高效的整数逻辑处理器设计的,该处理器使用泄漏积分和激发(LIF)神经元模型模拟参数和活动稀疏尖峰神经网络(SNN)。Xylo上的神经元是在同步数字CMOS中操作的量化整数器件,神经元和突触状态量化为16位,权重参数量化为8位。Xylo是为实时流操作量身定制的,而不是在推理加速器的情况下进行加速操作。XyloAudio包括一个低功耗音频编码接口,用于直接连接到麦克风,旨在对事件音频进行稀疏编码,以便推理核心进行进一步处理。在本报告中,我们介绍了部署到XyloAudio 2的DCASE 2020声学场景分类音频基准数据集的结果。我们描述了基准数据集;音频预处理方法;以及网络架构和训练方法。我们展示了训练模型的性能,以及在XyloAudio 2开发工具包上进行的功率和延迟测量的结果。该基准测试作为Neurobench项目的一部分进行。
摘要:XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.
标题:DC-Spin:用于口语模型的扬声器不变语音令牌器
链接:https://arxiv.org/abs/2410.24177
备注:Preprint
摘要:随着基于文本、仅解码器的语言模型的发展,口语模型(SLM)得到了越来越多的关注。SLM处理文本和语音,使同时语音理解和生成。本文提出了双码本说话人不变聚类(DC-Spin),其目的是通过桥接音频信号和SLM令牌来改进语音令牌化。DC-Spin提取具有丰富语音信息和对输入变化有弹性的说话人不变令牌,增强了zero-shot SLM任务和语音再合成。我们提出了一种分块的方法,使流直流旋转没有再训练和退化。令牌化方法(自监督和神经音频编解码器),模型可扩展性和下游任务代理的比较表明,易于通过n元LM建模或与音素对齐的令牌提供了强大的性能,为设计SLM的语音令牌提供了见解。
摘要:Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.
标题:任务感知统一源分离
链接:https://arxiv.org/abs/2410.23987
备注:Submitted to ICASSP 2025
摘要:已经进行了几次尝试以利用单个模型来处理多个源分离任务,诸如语音增强、语音分离、声音事件分离、音乐源分离(MSS)或电影音频源分离(CASS)。这些模型是在大规模数据(包括语音、乐器或声音事件)上训练的,通常可以成功地分离各种来源。然而,这类模型覆盖所有分离任务仍然具有挑战性,因为它们中的一些是矛盾的(例如,乐器在MSS中是分开的,而它们必须在CASS中分组)。为了克服这个问题,并支持所有主要的分离任务,我们提出了一个任务感知的统一源分离(TUSS)模型。该模型使用可变数量的可学习提示来指定要分离的源,并根据给定的提示改变其行为,使其能够处理所有主要的分离任务,包括矛盾的任务。实验结果表明,所提出的TUSS模型成功地处理上述五个主要的分离任务。我们还提供了一些音频示例,包括合成混合和真实录音,以展示TUSS模型如何灵活地根据提示改变其推理行为。
摘要:Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.
标题:声乐教育中的迁移学习:描述女中音的有限样本的技术评估
链接:https://arxiv.org/abs/2410.23325
摘要:由于歌唱者声音的个体差异和歌唱技术的量化标准不同,音乐领域的声乐教育很难量化。深度学习在音乐教育中具有巨大的应用潜力,因为它可以有效地处理复杂的数据并进行定量分析。然而,在罕见的声乐类型(如女中音)上使用有限的样本进行准确评估,需要使用深度学习模型进行广泛的注释数据支持。为了实现这一目标,我们通过使用在ImageNet和Urbansound8k数据集上预训练的深度学习模型来执行迁移学习,以提高声乐技术评估的精度。此外,我们通过构建一个专用的数据集,女中音声乐集(MVS),声乐技术评估,解决了样本不足的问题。我们的实验结果表明,迁移学习将所有模型的总体准确度(OAcc)平均提高了8.3%,最高准确度为94.2%。本研究不仅为女中音声乐技术的评价提供了一种新的方法,也为音乐教育提供了一种新的定量评价方法。
摘要:Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, such as Mezzo-soprano, requires extensive well-annotated data support using deep learning models. In order to attain the objective, we perform transfer learning by employing deep learning models pre-trained on the ImageNet and Urbansound8k datasets for the improvement on the precision of vocal technique evaluation. Furthermore, we tackle the problem of the lack of samples by constructing a dedicated dataset, the Mezzo-soprano Vocal Set (MVS), for vocal technique assessment. Our experimental results indicate that transfer learning increases the overall accuracy (OAcc) of all models by an average of 8.3%, with the highest accuracy at 94.2%. We not only provide a novel approach to evaluating Mezzo-soprano vocal techniques but also introduce a new quantitative assessment method for music education.
标题:Lina-Speech:Gated Linear Attention是一款快速且参数高效的文本到语音合成学习器
链接:https://arxiv.org/abs/2410.23320
备注:Preprint
摘要:神经编解码器语言模型在文本到语音(TTS)合成中实现了最先进的性能,利用了自回归Transformers和大规模语音数据集等可扩展架构。通过将语音克隆作为一个即时的连续任务,这些模型擅长从短音频样本中克隆语音。然而,这种方法在其处理大量或冗长的语音摘录的能力方面受到限制,因为源语音和目标语音的级联必须落入在训练期间确定的最大上下文长度内。在这项工作中,我们介绍了Lina-Speech,这是一种用新兴的递归架构(如Gated Linear Attention(GLA))取代传统自我注意机制的模型。在RWKV初始状态调整成功的基础上,我们将这种技术扩展到语音克隆,使多个语音样本的使用和充分利用的上下文窗口合成。这种方法快速,易于部署,并且当数据集大小范围从3到15分钟时,可以实现与微调基线相当的性能。值得注意的是,Lina-Speech匹配或优于最先进的基线模型,包括一些参数计数高达四倍或以端到端风格训练的模型。我们公布代码和检查点。音频样本可在https://theodorblackbird.github.io/blog/demo_lina/上获得。
摘要:Neural codec language models have achieved state-of-the-art performance in text-to-speech (TTS) synthesis, leveraging scalable architectures like autoregressive transformers and large-scale speech datasets. By framing voice cloning as a prompt continuation task, these models excel at cloning voices from short audio samples. However, this approach is limited in its ability to handle numerous or lengthy speech excerpts, since the concatenation of source and target speech must fall within the maximum context length which is determined during training. In this work, we introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA). Building on the success of initial-state tuning on RWKV, we extend this technique to voice cloning, enabling the use of multiple speech samples and full utilization of the context window in synthesis. This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes. Notably, Lina-Speech matches or outperforms state-of-the-art baseline models, including some with a parameter count up to four times higher or trained in an end-to-end style. We release our code and checkpoints. Audio samples are available at https://theodorblackbird.github.io/blog/demo_lina/.
标题:DDMZ:人工智能驱动的数字毒品音乐检测器
链接:https://arxiv.org/abs/2410.23293
备注:14 pages
摘要:我们提出了第一个版本的DDMD(数字毒品音乐检测器),一个二进制分类器,区分数字毒品音乐从正常的音乐。在文献中,数字毒品音乐主要探讨其心理,神经或社会影响。然而,尽管有许多关于在音乐信息检索(MIR)中使用机器学习的研究,包括音乐流派分类,但数字毒品音乐尚未在该领域得到考虑。在这项研究中,我们最初收集了3,176个音频文件的数据集,分为两类(1,676个数字药物和1,500个非数字药物)。我们提取了机器学习特征,包括MFCC、色度、光谱对比度和频率分析指标(检测到的频率的平均值和标准差)。使用随机森林分类器,我们实现了93%的准确率。最后,我们开发了一个Web应用程序来部署该模型,使最终用户能够检测数字毒品音乐。
摘要:We present the first version of DDMD (Digital Drug Music Detector), a binary classifier that distinguishes digital drug music from normal music. In the literature, digital drug music is primarily explored regarding its psychological, neurological, or social impact. However, despite numerous studies on using machine learning in Music Information Retrieval (MIR), including music genre classification, digital drug music has not been considered in this field. In this study, we initially collected a dataset of 3,176 audio files divided into two classes (1,676 digital drugs and 1,500 non-digital drugs). We extracted machine learning features, including MFCCs, chroma, spectral contrast, and frequency analysis metrics (mean and standard deviation of detected frequencies). Using a Random Forest classifier, we achieved an accuracy of 93%. Finally, we developed a web application to deploy the model, enabling end users to detect digital drug music.
链接:https://arxiv.org/abs/2410.24177
备注:Preprint
摘要:随着基于文本、仅解码器的语言模型的发展,口语模型(SLM)得到了越来越多的关注。SLM处理文本和语音,使同时语音理解和生成。本文提出了双码本说话人不变聚类(DC-Spin),其目的是通过桥接音频信号和SLM令牌来改进语音令牌化。DC-Spin提取具有丰富语音信息和对输入变化有弹性的说话人不变令牌,增强了zero-shot SLM任务和语音再合成。我们提出了一种分块的方法,使流直流旋转没有再训练和退化。令牌化方法(自监督和神经音频编解码器),模型可扩展性和下游任务代理的比较表明,易于通过n元LM建模或与音素对齐的令牌提供了强大的性能,为设计SLM的语音令牌提供了见解。
摘要:Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.
标题:Cough-E:一种用于边缘的多模式、保护隐私的咳嗽检测算法
链接:https://arxiv.org/abs/2410.24066
备注:14 pages, 10 figures
摘要:连续咳嗽监测器可以极大地帮助医生进行家庭监测和治疗呼吸系统疾病。虽然已经提出了许多算法,但它们仍然面临数据隐私和短期监控的限制。Edge-AI通过在源附近处理隐私敏感数据提供了一个很有前途的解决方案,但在受限设备上部署资源密集型算法时会遇到挑战。从音频和运动信号的适当选择,我们的方法旨在通过递归特征消除与交叉验证(RFECV),利用所选XGB模型的可解释性的功能的最佳选择。此外,它分析了使用Mel谱图特征,而不是更常见的MFCC。此外,一组超参数的多模态实现的分类器进行了探索。最后,它根据临床相关的基于事件的指标评估性能。我们应用我们的方法来开发Cough-E,这是一种节能,多模态和边缘AI咳嗽检测算法。它利用两个不同的分类器中的音频和运动学数据,共同合作以实现平衡的能量和性能权衡。我们证明了我们的算法可以在ARM Cortex M33微控制器上实时执行。Cough-E与纯音频方法相比,实现了70.56%的节能,代价是相对性能下降1.26%,导致0.78 F1分数。Cough-E和边缘感知模型优化方法都是公开的开源代码。这种方法证明了所提出的硬件感知方法的好处,以在边缘实现隐私保护咳嗽监测器,为有效的咳嗽监测铺平了道路。
摘要:Continuous cough monitors can greatly aid doctors in home monitoring and treatment of respiratory diseases. Although many algorithms have been proposed, they still face limitations in data privacy and short-term monitoring. Edge-AI offers a promising solution by processing privacy-sensitive data near the source, but challenges arise in deploying resource-intensive algorithms on constrained devices. From a suitable selection of audio and kinematic signals, our methodology aims at the optimal selection of features via Recursive Feature Elimination with Cross-Validation (RFECV), which exploits the explainability of the selected XGB model. Additionally, it analyzes the use of Mel spectrogram features, instead of the more common MFCC. Moreover, a set of hyperparameters for a multimodal implementation of the classifier is explored. Finally, it evaluates the performance based on clinically relevant event-based metrics. We apply our methodology to develop Cough-E, an energy-efficient, multimodal and edge AI cough detection algorithm. It exploits audio and kinematic data in two distinct classifiers, jointly cooperating for a balanced energy and performance trade-off. We demonstrate that our algorithm can be executed in real-time on an ARM Cortex M33 microcontroller. Cough-E achieves a 70.56\% energy saving when compared to the audio-only approach, at the cost of a 1.26\% relative performance drop, resulting in a 0.78 F1-score. Both Cough-E and the edge-aware model optimization methodology are publicly available as open-source code. This approach demonstrates the benefits of the proposed hardware-aware methodology to enable privacy-preserving cough monitors on the edge, paving the way to efficient cough monitoring.
标题:任务感知统一源分离
链接:https://arxiv.org/abs/2410.23987
备注:Submitted to ICASSP 2025
摘要:已经进行了几次尝试以利用单个模型来处理多个源分离任务,诸如语音增强、语音分离、声音事件分离、音乐源分离(MSS)或电影音频源分离(CASS)。这些模型是在大规模数据(包括语音、乐器或声音事件)上训练的,通常可以成功地分离各种来源。然而,这类模型覆盖所有分离任务仍然具有挑战性,因为它们中的一些是矛盾的(例如,乐器在MSS中是分开的,而它们必须在CASS中分组)。为了克服这个问题,并支持所有主要的分离任务,我们提出了一个任务感知的统一源分离(TUSS)模型。该模型使用可变数量的可学习提示来指定要分离的源,并根据给定的提示改变其行为,使其能够处理所有主要的分离任务,包括矛盾的任务。实验结果表明,所提出的TUSS模型成功地处理上述五个主要的分离任务。我们还提供了一些音频示例,包括合成混合和真实录音,以展示TUSS模型如何灵活地根据提示改变其推理行为。
摘要:Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.
标题:多分辨率言语自我监督学习的实证分析
链接:https://arxiv.org/abs/2410.23955
摘要:自监督学习(SSL)模型在语音处理中已经变得至关重要,最近的进展集中在开发跨多个时间尺度捕获表示的架构上。这些多尺度架构的主要目标是利用语音的分层性质,其中较低分辨率的组件旨在捕获与越来越抽象的概念(例如,从电话到单词到句子)。虽然多尺度方法已经证明了比单尺度模型的一些改进,但这些改进的确切原因缺乏经验支持。在这项研究中,我们提出了一个初步的分析分层表示在多尺度架构,重点是典型相关分析(CCA)和互信息(MI)。我们将此分析应用于多分辨率HuBERT(MR-HuBERT),并发现(1)SUPERB任务的性能改善主要是由于辅助低分辨率损失而不是下采样本身,以及(2)下采样到较低分辨率既不会改善下游性能,也不会与更高级别的信息(例如,虽然它确实提高了计算效率。这些发现挑战了关于MR-HuBERT的多尺度性质的假设,并激发了从学习更好的表示中分离计算效率的重要性。
摘要:Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.
标题:新视角声学参数估计
链接:https://arxiv.org/abs/2410.23523
备注:10 pages main text, 27 pages total; submitted to ICLR 2025, under review
摘要:新视图声学合成(NVAS)的任务-为场景中看不见的源和接收器位置生成房间脉冲响应(RIR)-最近获得了关注,特别是考虑到其与增强现实(AR)和虚拟现实(VR)开发的相关性。然而,这些努力中的许多都受到类似的限制:它们在时域中推断RIR,这证明优化具有挑战性;它们专注于具有简单的单房间几何形状的场景;它们仅推断单通道,方向无关的声学特性;并且它们需要输入,例如具有材料属性的3D几何网格,这对于设备上的应用可能是不切实际的。另一方面,研究表明,在AR和VR中,RIR的样本准确度并不是感知可扩展性所必需的。标准声学参数如清晰度指数(C50)或混响时间(T60)已被证明能够描述RIR的相关特性,特别是后期混响。为了解决这些差距,本文介绍了一个新的任务集中在估计空间分布的声学参数,然后可以用来调节一个简单的混响器为任意源和接收器的位置。该方法被建模为图像到图像转换任务,其将场景的2D平面图转换为声学参数的2D热图。我们介绍了一个新的,大规模的数据集,由1000个场景组成的复杂的,多房间的公寓条件,并表明我们的方法优于统计基线显着。此外,我们表明,该方法也适用于方向相关(即波束成形)参数预测。最后,所提出的方法在非常有限的信息上操作,在推理时仅需要场景的大致轮廓和单个RIR。
摘要:The task of Novel View Acoustic Synthesis (NVAS) - generating Room Impulse Responses (RIRs) for unseen source and receiver positions in a scene - has recently gained traction, especially given its relevance to Augmented Reality (AR) and Virtual Reality (VR) development. However, many of these efforts suffer from similar limitations: they infer RIRs in the time domain, which prove challenging to optimize; they focus on scenes with simple, single-room geometries; they infer only single-channel, directionally-independent acoustic characteristics; and they require inputs, such as 3D geometry meshes with material properties, that may be impractical to obtain for on-device applications. On the other hand, research suggests that sample-wise accuracy of RIRs is not required for perceptual plausibility in AR and VR. Standard acoustic parameters like Clarity Index (C50) or Reverberation Time (T60) have been shown to capably describe pertinent characteristics of the RIRs, especially late reverberation. To address these gaps, this paper introduces a new task centered on estimating spatially distributed acoustic parameters that can be then used to condition a simple reverberator for arbitrary source and receiver positions. The approach is modelled as an image-to-image translation task, which translates 2D floormaps of a scene into 2D heatmaps of acoustic parameters. We introduce a new, large-scale dataset of 1000 scenes consisting of complex, multi-room apartment conditions, and show that our method outperforms statistical baselines significantly. Moreover, we show that the method also works for directionally-dependent (i.e. beamformed) parameter prediction. Finally, the proposed method operates on very limited information, requiring only a broad outline of the scene and a single RIR at inference time.
标题:声乐教育中的迁移学习:描述女中音的有限样本的技术评估
链接:https://arxiv.org/abs/2410.23325
摘要:由于歌唱者声音的个体差异和歌唱技术的量化标准不同,音乐领域的声乐教育很难量化。深度学习在音乐教育中具有巨大的应用潜力,因为它可以有效地处理复杂的数据并进行定量分析。然而,在罕见的声乐类型(如女中音)上使用有限的样本进行准确评估,需要使用深度学习模型进行广泛的注释数据支持。为了实现这一目标,我们通过使用在ImageNet和Urbansound8k数据集上预训练的深度学习模型来执行迁移学习,以提高声乐技术评估的精度。此外,我们通过构建一个专用的数据集,女中音声乐集(MVS),声乐技术评估,解决了样本不足的问题。我们的实验结果表明,迁移学习将所有模型的总体准确度(OAcc)平均提高了8.3%,最高准确度为94.2%。本研究不仅为女中音声乐技术的评价提供了一种新的方法,也为音乐教育提供了一种新的定量评价方法。
摘要:Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, such as Mezzo-soprano, requires extensive well-annotated data support using deep learning models. In order to attain the objective, we perform transfer learning by employing deep learning models pre-trained on the ImageNet and Urbansound8k datasets for the improvement on the precision of vocal technique evaluation. Furthermore, we tackle the problem of the lack of samples by constructing a dedicated dataset, the Mezzo-soprano Vocal Set (MVS), for vocal technique assessment. Our experimental results indicate that transfer learning increases the overall accuracy (OAcc) of all models by an average of 8.3%, with the highest accuracy at 94.2%. We not only provide a novel approach to evaluating Mezzo-soprano vocal techniques but also introduce a new quantitative assessment method for music education.
标题:利用非洲语言之间的语音相似性实现语音到语音翻译
链接:https://arxiv.org/abs/2410.23323
摘要:本文提出了一个试点研究直接语音到语音翻译(S2 ST)利用选定的非洲语言之间的语言相似性在同一门,特别是在传统的数据注释是昂贵的或不切实际的情况下。我们提出了一个基于段的模型,映射语音段内和跨语言门,有效地消除了对大型配对数据集的需要。通过利用成对的片段和引导扩散,我们的模型可以在数据集中的任何两种语言之间进行翻译。我们评估模型的专有数据集从肯尼亚广播公司(KBC),其中包括五种语言:斯瓦希里语,卢奥语,基库尤语,南迪语和英语。该模型在段配对和翻译质量方面表现出竞争力,特别是对于同一门语言。我们的实验表明,段长度显着影响翻译的准确性,平均长度的段产生最高的配对质量。与传统的级联ASR-MT技术的比较分析表明,该模型提供了几乎相当的翻译性能。这项研究强调了利用语言组内的语言相似性来执行有效的S2 ST的潜力,特别是在低资源的语言环境中。
摘要:This paper presents a pilot study on direct speech-to-speech translation (S2ST) by leveraging linguistic similarities among selected African languages within the same phylum, particularly in cases where traditional data annotation is expensive or impractical. We propose a segment-based model that maps speech segments both within and across language phyla, effectively eliminating the need for large paired datasets. By utilizing paired segments and guided diffusion, our model enables translation between any two languages in the dataset. We evaluate the model on a proprietary dataset from the Kenya Broadcasting Corporation (KBC), which includes five languages: Swahili, Luo, Kikuyu, Nandi, and English. The model demonstrates competitive performance in segment pairing and translation quality, particularly for languages within the same phylum. Our experiments reveal that segment length significantly influences translation accuracy, with average-length segments yielding the highest pairing quality. Comparative analyses with traditional cascaded ASR-MT techniques show that the proposed model delivers nearly comparable translation performance. This study underscores the potential of exploiting linguistic similarities within language groups to perform efficient S2ST, especially in low-resource language contexts.
标题:Lina-Speech:Gated Linear Attention是一款快速且参数高效的文本到语音合成学习器
链接:https://arxiv.org/abs/2410.23320
备注:Preprint
摘要:神经编解码器语言模型在文本到语音(TTS)合成中实现了最先进的性能,利用了自回归Transformers和大规模语音数据集等可扩展架构。通过将语音克隆作为一个即时的连续任务,这些模型擅长从短音频样本中克隆语音。然而,这种方法在其处理大量或冗长的语音摘录的能力方面受到限制,因为源语音和目标语音的级联必须落入在训练期间确定的最大上下文长度内。在这项工作中,我们介绍了Lina-Speech,这是一种用新兴的递归架构(如Gated Linear Attention(GLA))取代传统自我注意机制的模型。在RWKV初始状态调整成功的基础上,我们将这种技术扩展到语音克隆,使多个语音样本的使用和充分利用的上下文窗口合成。这种方法快速,易于部署,并且当数据集大小范围从3到15分钟时,可以实现与微调基线相当的性能。值得注意的是,Lina-Speech匹配或优于最先进的基线模型,包括一些参数计数高达四倍或以端到端风格训练的模型。我们公布代码和检查点。音频样本可在https://theodorblackbird.github.io/blog/demo_lina/上获得。
摘要:Neural codec language models have achieved state-of-the-art performance in text-to-speech (TTS) synthesis, leveraging scalable architectures like autoregressive transformers and large-scale speech datasets. By framing voice cloning as a prompt continuation task, these models excel at cloning voices from short audio samples. However, this approach is limited in its ability to handle numerous or lengthy speech excerpts, since the concatenation of source and target speech must fall within the maximum context length which is determined during training. In this work, we introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA). Building on the success of initial-state tuning on RWKV, we extend this technique to voice cloning, enabling the use of multiple speech samples and full utilization of the context window in synthesis. This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes. Notably, Lina-Speech matches or outperforms state-of-the-art baseline models, including some with a parameter count up to four times higher or trained in an end-to-end style. We release our code and checkpoints. Audio samples are available at https://theodorblackbird.github.io/blog/demo_lina/.
标题:DDMZ:人工智能驱动的数字毒品音乐检测器
链接:https://arxiv.org/abs/2410.23293
备注:14 pages
摘要:我们提出了第一个版本的DDMD(数字毒品音乐检测器),一个二进制分类器,区分数字毒品音乐从正常的音乐。在文献中,数字毒品音乐主要探讨其心理,神经或社会影响。然而,尽管有许多关于在音乐信息检索(MIR)中使用机器学习的研究,包括音乐流派分类,但数字毒品音乐尚未在该领域得到考虑。在这项研究中,我们最初收集了3,176个音频文件的数据集,分为两类(1,676个数字药物和1,500个非数字药物)。我们提取了机器学习特征,包括MFCC、色度、光谱对比度和频率分析指标(检测到的频率的平均值和标准差)。使用随机森林分类器,我们实现了93%的准确率。最后,我们开发了一个Web应用程序来部署该模型,使最终用户能够检测数字毒品音乐。
摘要:We present the first version of DDMD (Digital Drug Music Detector), a binary classifier that distinguishes digital drug music from normal music. In the literature, digital drug music is primarily explored regarding its psychological, neurological, or social impact. However, despite numerous studies on using machine learning in Music Information Retrieval (MIR), including music genre classification, digital drug music has not been considered in this field. In this study, we initially collected a dataset of 3,176 audio files divided into two classes (1,676 digital drugs and 1,500 non-digital drugs). We extracted machine learning features, including MFCCs, chroma, spectral contrast, and frequency analysis metrics (mean and standard deviation of detected frequencies). Using a Random Forest classifier, we achieved an accuracy of 93%. Finally, we developed a web application to deploy the model, enabling end users to detect digital drug music.
标题:言语不仅仅是言语:言语转文本翻译系统是否利用韵律?
链接:https://arxiv.org/abs/2410.24019
备注:WMT 2024
摘要:语音的韵律,包括重音、语调和节奏等特征,会显著地影响潜在的语义,从而也会影响其语篇翻译。然而,韵律很少在语音到文本翻译(S2 TT)系统的背景下进行研究。特别地,端到端(E2 E)系统已经被提出为非常适合韵律感知翻译,因为它们在做出翻译决策时可以直接访问语音信号,但是对这在实践中是否成功的理解仍然有限。一个主要的挑战是在翻译中评估韵律意识的困难。为了应对这一挑战,我们引入了一种评估方法和一个集中的基准(名为ContraProST),旨在捕捉广泛的韵律现象。我们的方法使用大型语言模型和可控的文本到语音(TTS)生成对比的例子。通过将英语语音翻译成德语、西班牙语和日语的实验,我们发现:(a)S2 TT模型具有韵律的一些内部表示,但韵律信号通常不足以影响翻译;(b)E2 E系统优于语音识别和文本翻译系统的级联,证实了它们在这方面的理论优势,以及(c)某些级联系统也捕获翻译中的韵律信息,但仅在较小程度上取决于转录本的表面形式的细节。
摘要:The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and (c) certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript's surface form.
标题:音频是阿喀琉斯之踵:红色团队音频大型多模式
链接:https://arxiv.org/abs/2410.23861
摘要:大型多模态模型(Large Multimodal Models,LLM)已经证明了在现实世界条件下与人类交互的能力,它结合了大型语言模型(Large Language Models,LLM)和模态编码器,将多模态信息(视觉和听觉)与文本对齐。然而,这些模型提出了新的安全挑战,即在文本上安全对齐的模型是否也对多模式输入表现出一致的保障。尽管最近对视觉激光器的安全性进行了研究,但音频激光器的安全性仍未得到充分探讨。在这项工作中,我们在三种设置下全面评估了五种高级音频Lancers的安全性:(i)音频和文本格式的有害问题,(ii)文本格式的有害问题伴随着分散注意力的非语音音频,以及(iii)语音特定的越狱。在这些设置下,我们的结果表明,开源音频Lubricant在有害音频问题上的平均攻击成功率为69.14%,并且在被非语音音频噪声分散注意力时表现出安全漏洞。我们在Gemini-1.5-Pro上的语音越狱在有害查询基准测试中的攻击成功率为70.67%。我们提供了有关可能导致这些报告的安全失调的原因的见解。警告:本文包含令人反感的例子。
摘要:Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining Large Language Models (LLMs) and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.
标题:ISCSLP 2024鼓舞人心且令人信服的音频生成挑战赛的NPU-HWC系统
链接:https://arxiv.org/abs/2410.23815
备注:accepted by ISCSLP 2024
摘要:本文介绍了提交给ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024(ICAGC)的NPU-HWC系统。我们的系统包括两个模块:一个语音发生器的轨道1和背景音频发生器的轨道2。在Track 1中,我们使用Single-Codec将语音标记为离散的标记,并使用基于语言模型的方法来实现zero-shot说话风格克隆。单编解码器在令牌级别有效地简化了音色和说话风格,减少了自回归语言模型上的声学建模负担。此外,我们使用DSPGAN将16 kHz梅尔频谱图上采样为高保真48 kHz波形。在第二部分中,我们提出了一个基于大型语言模型(LLM)的背景音频生成器。该系统产生适合场景的伴奏描述,与Tango 2合成背景音频,并将其与我们的Track 1系统生成的语音集成。我们的作品分别在第一和第二轨道中获得第二名和第一名。
摘要:This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.
标题:通过调和/冲击源分离和卷积神经网络改善有限数据集下的打鼾检测
链接:https://arxiv.org/abs/2410.23796
备注:None
摘要:打鼾是阻塞性睡眠呼吸暂停综合征(OSAS)患者中常见的声学生物标志物,对诊断和监测这种公认的临床疾病具有重要意义。无论打鼾类型如何,大多数打鼾情况都表现出可识别的谐波模式,这些模式通过随时间的不同能量分布表现出来。在这项工作中,我们提出了一种新的方法来区分单耳打鼾从非打鼾的声音,通过分析谐波含量的输入声音,使用谐波/谐波声源分离(HPSS)。基于HPSS的谐波频谱图,所得到的特征被用作传统神经网络架构的输入数据,旨在即使在有限的数据学习框架下也能提高打鼾检测性能。为了评估我们的建议的性能,我们研究了两种不同的情况:1)使用打鼾和干扰声音的大型数据集,以及2)使用由大约1%的数据材料组成的精简训练集。在前一种情况下,建议HPSS为基础的功能提供了竞争力的结果相比,其他输入功能从文献。然而,所提出的方法的关键优势在于在有限的数据学习环境中从HPSS得到的谐波谱图的优异性能。在这种特定的情况下,使用所提出的谐波功能显着提高了所有研究的架构相比,在现有文献中记载的经典输入功能的性能。这一发现清楚地表明,结合谐波内容能够更可靠地学习大多数打鼾声音中普遍存在的基本时频特性,即使在训练数据量有限的情况下也是如此。
摘要:Snoring, an acoustic biomarker commonly observed in individuals with Obstructive Sleep Apnoea Syndrome (OSAS), holds significant potential for diagnosing and monitoring this recognized clinical disorder. Irrespective of snoring types, most snoring instances exhibit identifiable harmonic patterns manifested through distinctive energy distributions over time. In this work, we propose a novel method to differentiate monaural snoring from non-snoring sounds by analyzing the harmonic content of the input sound using harmonic/percussive sound source separation (HPSS). The resulting feature, based on the harmonic spectrogram from HPSS, is employed as input data for conventional neural network architectures, aiming to enhance snoring detection performance even under a limited data learning framework. To evaluate the performance of our proposal, we studied two different scenarios: 1) using a large dataset of snoring and interfering sounds, and 2) using a reduced training set composed of around 1% of the data material. In the former scenario, the proposed HPSS-based feature provides competitive results compared to other input features from the literature. However, the key advantage of the proposed method lies in the superior performance of the harmonic spectrogram derived from HPSS in a limited data learning context. In this particular scenario, using the proposed harmonic feature significantly enhances the performance of all the studied architectures in comparison to the classical input features documented in the existing literature. This finding clearly demonstrates that incorporating harmonic content enables more reliable learning of the essential time-frequency characteristics that are prevalent in most snoring sounds, even in scenarios where the amount of training data is limited.
标题:Neurobench:XyloAudio 2上的DUSE 2020声学场景分类基准
链接:https://arxiv.org/abs/2410.23776
摘要:XyloAudio是一系列超低功耗音频推理芯片,专为实时能量受限场景中的麦克风内和麦克风附近音频分析而设计。Xylo是围绕一个高效的整数逻辑处理器设计的,该处理器使用泄漏积分和激发(LIF)神经元模型模拟参数和活动稀疏尖峰神经网络(SNN)。Xylo上的神经元是在同步数字CMOS中操作的量化整数器件,神经元和突触状态量化为16位,权重参数量化为8位。Xylo是为实时流操作量身定制的,而不是在推理加速器的情况下进行加速操作。XyloAudio包括一个低功耗音频编码接口,用于直接连接到麦克风,旨在对事件音频进行稀疏编码,以便推理核心进行进一步处理。在本报告中,我们介绍了部署到XyloAudio 2的DCASE 2020声学场景分类音频基准数据集的结果。我们描述了基准数据集;音频预处理方法;以及网络架构和训练方法。我们展示了训练模型的性能,以及在XyloAudio 2开发工具包上进行的功率和延迟测量的结果。该基准测试是Neurobench项目的一部分。
摘要:XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.