本文经arXiv每日学术速递授权转载
链接:https://arxiv.org/abs/2410.19722
备注:10 pages, 10 figures, ICMLC2024
摘要:早期检测工业机械部件中的潜在故障对于确保操作的可靠性和安全性至关重要,从而保护机器状态监控(MCM)。本研究通过引入一种创新的方法来实时声学异常检测来解决这一问题。我们的方法将半监督时间卷积与表示学习和时间卷积网络(TCN)的混合模型策略相结合,以有效地处理声学数据中发现的各种复杂的异常模式。所提出的模型表现出优越的性能相比,在该领域的既定研究,强调这种方法的有效性。我们不仅提出了其优越性的定量证据,但我们也采用视觉表示,如t-SNE图,以进一步证实该模型的功效。
摘要:The early detection of potential failures in industrial machinery components is paramount for ensuring the reliability and safety of operations, thereby preserving Machine Condition Monitoring (MCM). This research addresses this imperative by introducing an innovative approach to Real-Time Acoustic Anomaly Detection. Our method combines semi-supervised temporal convolution with representation learning and a hybrid model strategy with Temporal Convolutional Networks (TCN) to handle various intricate anomaly patterns found in acoustic data effectively. The proposed model demonstrates superior performance compared to established research in the field, underscoring the effectiveness of this approach. Not only do we present quantitative evidence of its superiority, but we also employ visual representations, such as t-SNE plots, to further substantiate the model's efficacy.
标题:使用深度学习进行阿拉伯音乐分类和生成
链接:https://arxiv.org/abs/2410.19719
摘要:本文提出了一种机器学习方法,用于按作曲家对古典和新埃及音乐进行分类,并生成新的相似音乐。所提出的系统利用卷积神经网络(CNN)的分类和CNN自动编码器的生成。该项目中使用的数据集由不同作曲家创作的新的和古典的埃及音乐作品组成。 为了按作曲家对音乐进行分类,每个样本都被归一化并转换成梅尔声谱图。CNN模型使用mel频谱图作为输入特征并使用作曲家标签作为输出类在数据集上进行训练。该模型在按作曲家分类时达到了81.4%的准确率,证明了该方法的有效性. 为了生成与原始作品相似的新音乐,在类似的数据集上训练CNN自动编码器。该模型经过训练,将原始片段的mel频谱图编码到低维潜在空间中,然后将其解码回原始mel频谱图。生成的音乐是通过从潜在空间进行采样并将样本解码回梅尔谱图来生成的,然后将其转换为音频。 总之,该系统提供了一个很有前途的方法来分类和生成古典埃及音乐,它可以应用于各种音乐应用程序,如音乐推荐系统,音乐制作和音乐教育。
摘要:This paper proposes a machine learning approach for classifying classical and new Egyptian music by composer and generating new similar music. The proposed system utilizes a convolutional neural network (CNN) for classification and a CNN autoencoder for generation. The dataset used in this project consists of new and classical Egyptian music pieces composed by different composers. To classify the music by composer, each sample is normalized and transformed into a mel spectrogram. The CNN model is trained on the dataset using the mel spectrograms as input features and the composer labels as output classes. The model achieves 81.4\% accuracy in classifying the music by composer, demonstrating the effectiveness of the proposed approach. To generate new music similar to the original pieces, a CNN autoencoder is trained on a similar dataset. The model is trained to encode the mel spectrograms of the original pieces into a lower-dimensional latent space and then decode them back into the original mel spectrogram. The generated music is produced by sampling from the latent space and decoding the samples back into mel spectrograms, which are then transformed into audio. In conclusion, the proposed system provides a promising approach to classifying and generating classical Egyptian music, which can be applied in various musical applications, such as music recommendation systems, music production, and music education.
标题:CloserMusicDB:高质量音乐的现代多用途数据集
链接:https://arxiv.org/abs/2410.19540
摘要:在本文中,我们介绍了CloserMusicDB,一个完整长度的工作室质量的轨道由人类专家团队注释的集合。我们描述了我们的数据集的选定质量,以及使用该数据集可能执行的三个示例任务:钩子检测,上下文标记和艺术家识别。我们进行基线实验,并为这些任务提供初始基准。
摘要:In this paper, we introduce CloserMusicDB, a collection of full length studio quality tracks annotated by a team of human experts. We describe the selected qualities of our dataset, along with three example tasks possible to perform using this dataset: hook detection, contextual tagging and artist identification. We conduct baseline experiments and provide initial benchmarks for these tasks.
标题:让社交平台变得易于使用:利用集成文本分析的认知语音生成
链接:https://arxiv.org/abs/2410.19199
备注:None
摘要:最近的研究概述了盲人或视力受损者以及识字率较低的人在与社交网络互动时面临的无障碍挑战,尽管有单调的文本转语音(TTS)屏幕阅读器和表情符号等视觉元素的音频叙述等便利技术。情感语音生成传统上依赖于人类输入的预期情感以及要合成的文本,围绕数据简化(导致信息丢失)和持续时间不准确的额外挑战,导致缺乏表达性的情感渲染。在现实生活中,音素的持续时间可能会有所不同,因为同一个句子可能会根据说话者的情绪状态或口音以各种方式说出(称为文本到语音生成的一对多问题)。因此,需要先进的语音合成系统来解决这种不可预测性。我们提出了一个端到端的上下文感知的文本到语音(TTS)合成系统,从文本输入中获得所传达的情感,并合成音频,专注于自然和富有表现力的语音的情感和扬声器功能,集成了先进的自然语言处理(NLP)和语音合成技术的实时应用。我们的系统还展示了竞争力的推理时间性能时,对国家的最先进的TTS模型进行基准测试,使其适合于实时可访问性应用程序。
摘要:Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.
标题:Aligncap:将语音情感字幕与人类偏好保持一致
链接:https://arxiv.org/abs/2410.19134
备注:Accepted to EMNLP2024 main conference
摘要:语音情感字幕逐渐成为一个活跃的研究课题。通过人类语音传达的情感内容通常是复杂的,将它们分类到固定的类别可能不足以完全捕捉语音情感。通过自然语言描述言语情感可能是一种更有效的方法。然而,现有的SEC方法往往产生幻觉,失去了对看不见的语音的概括。为了克服这些问题,我们提出了AlignCap,它基于大语言模型(LLM)将语音情感字幕与人类偏好对齐,具有两个特性:1)语音-文本对齐,使用知识蒸馏(KD)正则化最小化LLM对语音和文本输入的响应预测分布之间的差异。2)人类偏好对齐,我们设计偏好优化(PO)正则化来消除真实性和忠诚性幻觉。我们还提取情感线索,作为在KD正则化下丰富细粒度信息的提示。实验表明,AlignCap在Zero-shot SEC任务上表现出比其他最先进的方法更强的性能。
摘要:Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.
标题:用于扬声器独立联合定位和掩蔽估计的掩蔽加权空间似然编码
链接:https://arxiv.org/abs/2410.19595
备注:\copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
摘要:由于其鲁棒性和灵活性,神经驱动的波束形成器是在具有不同数量的同时扬声器以及噪声和混响的具有挑战性的环境中进行语音分离的流行选择。时频掩模和扬声器关于固定空间网格的相对方向可以用于估计波束形成器的参数。在某种程度上,说话人独立性是通过确保比语音源更大量的空间分区来实现的。在这项工作中,我们分析了如何将掩模和定位编码到这样的网格中,以实现对这两个量的联合估计。我们提出了掩模加权空间似然编码,并表明,它实现了相当大的性能,在这两个任务相比,基线编码优化本地化或掩模估计。在相同的设置中,我们证明了联合估计两个数量的优越性。最后,我们提出了一种通用的方法,它可以取代上游声源定位系统,仅通过调整训练框架,使其在性能关键的情况下高度相关。
摘要:Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios.
标题:MMAU:大规模多任务音频理解和推理基准
链接:https://arxiv.org/abs/2410.19168
备注:Project Website: this https URL
摘要:理解音频(包括语音、非语音声音和音乐)的能力对于AI智能体与世界进行有效交互至关重要。我们提出了MMAU,一种新的基准设计,以评估多模态音频理解模型的任务,需要专家级的知识和复杂的推理。MMAU包括10 k精心策划的音频片段,以及人类注释的自然语言问题和答案,涵盖语音,环境声音和音乐。它包括信息提取和推理问题,要求模型在独特和具有挑战性的任务中展示27种不同的技能。与现有的基准不同,MMAU强调具有特定领域知识的高级感知和推理,挑战模型来解决类似于专家面临的任务。我们评估了18个开源和专有(大型)音频语言模型,展示了MMAU带来的重大挑战。值得注意的是,即使是最先进的Gemini Pro v1.5也只能达到52.97%的准确率,最先进的开源Qwen 2-Audio也只能达到52.50%,这凸显了相当大的改进空间。我们相信MMAU将推动音频和多模态研究社区开发更先进的音频理解模型,能够解决复杂的音频任务。
摘要:The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
标题:用于扬声器独立联合定位和掩蔽估计的掩蔽加权空间似然编码
链接:https://arxiv.org/abs/2410.19595
备注:\copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
摘要:由于其鲁棒性和灵活性,神经驱动的波束形成器是在具有不同数量的同时扬声器以及噪声和混响的具有挑战性的环境中进行语音分离的流行选择。时频掩模和扬声器关于固定空间网格的相对方向可以用于估计波束形成器的参数。在某种程度上,说话人独立性是通过确保比语音源更大量的空间分区来实现的。在这项工作中,我们分析了如何将掩模和定位编码到这样的网格中,以实现对这两个量的联合估计。我们提出了掩模加权空间似然编码,并表明,它实现了相当大的性能,在这两个任务相比,基线编码优化本地化或掩模估计。在相同的设置中,我们证明了联合估计两个数量的优越性。最后,我们提出了一种通用的方法,它可以取代上游声源定位系统,仅通过调整训练框架,使其在性能关键的情况下高度相关。
摘要:Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios.
标题:MMAU:大规模多任务音频理解和推理基准
链接:https://arxiv.org/abs/2410.19168
备注:Project Website: this https URL
摘要:理解音频(包括语音、非语音声音和音乐)的能力对于AI智能体与世界进行有效交互至关重要。我们提出了MMAU,一种新的基准设计,以评估多模态音频理解模型的任务,需要专家级的知识和复杂的推理。MMAU包括10 k精心策划的音频片段,以及人类注释的自然语言问题和答案,涵盖语音,环境声音和音乐。它包括信息提取和推理问题,要求模型在独特和具有挑战性的任务中展示27种不同的技能。与现有的基准不同,MMAU强调具有特定领域知识的高级感知和推理,挑战模型以解决类似于专家所面临的任务。我们评估了18个开源和专有(大型)音频语言模型,展示了MMAU带来的重大挑战。值得注意的是,即使是最先进的Gemini Pro v1.5也只能达到52.97%的准确率,最先进的开源Qwen 2-Audio也只能达到52.50%,这凸显了相当大的改进空间。我们相信MMAU将推动音频和多模态研究社区开发更先进的音频理解模型,能够解决复杂的音频任务。
摘要:The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
标题:用于实时声学异常检测的基于时间卷积的混合模型方法和表示学习
链接:https://arxiv.org/abs/2410.19722
备注:10 pages, 10 figures, ICMLC2024
摘要:早期检测工业机械部件中的潜在故障对于确保操作的可靠性和安全性至关重要,从而保护机器状态监控(MCM)。本研究通过引入一种创新的方法来实时声学异常检测来解决这一问题。我们的方法将半监督时间卷积与表示学习和时间卷积网络(TCN)的混合模型策略相结合,以有效地处理声学数据中发现的各种复杂的异常模式。所提出的模型表现出优越的性能相比,在该领域的既定研究,强调这种方法的有效性。我们不仅提出了其优越性的定量证据,但我们也采用视觉表示,如t-SNE图,以进一步证实该模型的功效。
摘要:The early detection of potential failures in industrial machinery components is paramount for ensuring the reliability and safety of operations, thereby preserving Machine Condition Monitoring (MCM). This research addresses this imperative by introducing an innovative approach to Real-Time Acoustic Anomaly Detection. Our method combines semi-supervised temporal convolution with representation learning and a hybrid model strategy with Temporal Convolutional Networks (TCN) to handle various intricate anomaly patterns found in acoustic data effectively. The proposed model demonstrates superior performance compared to established research in the field, underscoring the effectiveness of this approach. Not only do we present quantitative evidence of its superiority, but we also employ visual representations, such as t-SNE plots, to further substantiate the model's efficacy.
标题:使用深度学习进行阿拉伯音乐分类和生成
链接:https://arxiv.org/abs/2410.19719
摘要:本文提出了一种机器学习方法,用于按作曲家对古典和新埃及音乐进行分类,并生成新的相似音乐。所提出的系统利用卷积神经网络(CNN)的分类和CNN自动编码器的生成。该项目中使用的数据集由不同作曲家创作的新的和古典的埃及音乐作品组成。 为了按作曲家对音乐进行分类,每个样本都被归一化并转换成梅尔声谱图。CNN模型使用mel频谱图作为输入特征并使用作曲家标签作为输出类在数据集上进行训练。该模型在按作曲家分类时达到了81.4%的准确率,证明了该方法的有效性. 为了生成与原始作品相似的新音乐,在类似的数据集上训练CNN自动编码器。该模型经过训练,将原始片段的mel频谱图编码到低维潜在空间中,然后将其解码回原始mel频谱图。生成的音乐是通过从潜在空间采样并将样本解码回梅尔频谱图,然后将其转换为音频来产生的。 总之,该系统提供了一个很有前途的方法来分类和生成古典埃及音乐,它可以应用于各种音乐应用程序,如音乐推荐系统,音乐制作和音乐教育。
摘要:This paper proposes a machine learning approach for classifying classical and new Egyptian music by composer and generating new similar music. The proposed system utilizes a convolutional neural network (CNN) for classification and a CNN autoencoder for generation. The dataset used in this project consists of new and classical Egyptian music pieces composed by different composers. To classify the music by composer, each sample is normalized and transformed into a mel spectrogram. The CNN model is trained on the dataset using the mel spectrograms as input features and the composer labels as output classes. The model achieves 81.4\% accuracy in classifying the music by composer, demonstrating the effectiveness of the proposed approach. To generate new music similar to the original pieces, a CNN autoencoder is trained on a similar dataset. The model is trained to encode the mel spectrograms of the original pieces into a lower-dimensional latent space and then decode them back into the original mel spectrogram. The generated music is produced by sampling from the latent space and decoding the samples back into mel spectrograms, which are then transformed into audio. In conclusion, the proposed system provides a promising approach to classifying and generating classical Egyptian music, which can be applied in various musical applications, such as music recommendation systems, music production, and music education.
标题:CloserMusicDB:高质量音乐的现代多用途数据集
链接:https://arxiv.org/abs/2410.19540
摘要:在本文中,我们介绍了CloserMusicDB,一个完整长度的工作室质量的轨道由人类专家团队注释的集合。我们描述了我们的数据集的选定质量,以及使用该数据集可能执行的三个示例任务:钩子检测,上下文标记和艺术家识别。我们进行基线实验,并为这些任务提供初始基准。
摘要:In this paper, we introduce CloserMusicDB, a collection of full length studio quality tracks annotated by a team of human experts. We describe the selected qualities of our dataset, along with three example tasks possible to perform using this dataset: hook detection, contextual tagging and artist identification. We conduct baseline experiments and provide initial benchmarks for these tasks.
标题:让社交平台变得易于使用:利用集成文本分析的认知语音生成
链接:https://arxiv.org/abs/2410.19199
备注:None
摘要:最近的研究概述了盲人或视力受损者以及识字率较低的人在与社交网络互动时面临的无障碍挑战,尽管有单调的文本转语音(TTS)屏幕阅读器和表情符号等视觉元素的音频叙述等便利技术。情感语音生成传统上依赖于人类输入的预期情感以及要合成的文本,围绕数据简化(导致信息丢失)和持续时间不准确的额外挑战,导致缺乏表达性的情感渲染。在现实生活中,音素的持续时间可能会有所不同,因为同一个句子可能会根据说话者的情绪状态或口音以各种方式说出(称为文本到语音生成的一对多问题)。因此,需要先进的语音合成系统来解决这种不可预测性。我们提出了一个端到端的上下文感知的文本到语音(TTS)合成系统,从文本输入中获得所传达的情感,并合成音频,专注于自然和富有表现力的语音的情感和扬声器功能,集成了先进的自然语言处理(NLP)和语音合成技术的实时应用。我们的系统还展示了竞争力的推理时间性能时,对国家的最先进的TTS模型进行基准测试,使其适合于实时可访问性应用程序。
摘要:Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.
标题:Aligncap:将语音情感字幕与人类偏好保持一致
链接:https://arxiv.org/abs/2410.19134
备注:Accepted to EMNLP2024 main conference
摘要:语音情感字幕逐渐成为一个活跃的研究课题。通过人类语音传达的情感内容通常是复杂的,将它们分类到固定的类别可能不足以完全捕捉语音情感。通过自然语言描述言语情感可能是一种更有效的方法。然而,现有的SEC方法往往产生幻觉,失去了对看不见的语音的概括。为了克服这些问题,我们提出了AlignCap,它基于大语言模型(LLM)将语音情感字幕与人类偏好对齐,具有两个特性:1)语音-文本对齐,使用知识蒸馏(KD)正则化最小化LLM对语音和文本输入的响应预测分布之间的差异。2)人类偏好对齐,我们设计偏好优化(PO)正则化来消除真实性和忠诚性幻觉。我们还提取情感线索,作为在KD正则化下丰富细粒度信息的提示。实验表明,AlignCap在Zero-shot SEC任务上表现出比其他最先进的方法更强的性能。
摘要:Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.