语音处理落地组件全家桶:语音大模型、数字人、语音到文本、文本转语音全集

文摘   科技   2025-01-05 12:50   北京  

今天是2025年01月05日,星期日,北京,天气多云。

我们先开最近语音方面的几个进展,一个是关于语音模型发展的一个总括, Awesome-Audio-LLM,一个是7个数字人开源项目、24个语音到文本项目、29个文本转语音TTS开源工具,这些都可以为语音合成相关产品和demo打下基础。

专题化,体系化,会有更多深度思考。大家一起加油。

一、关于语音模型发展的一个总括

AudioLLMs Awesome-Audio-LLM音频领域大模型集 https://github.com/AudioLLMs/Awesome-Audio-LLM,这个项目很不错,做数字人和语言处理方向的可以关注,我觉得很全面了。

其中对发布的语言模型的列表,如下:

1)TangoFlux:https://arxiv.org/abs/2412.2103;【这里做了补充】

2)MERaLiON-AudioLLM:https://arxiv.org/abs/2412.09818;

3)ADU-Bench:https://arxiv.org/abs/2412.05167;

4)Dynamic-SUPERB Phase-2:https://arxiv.org/pdf/2411.05361;

5)-- AudioLLM:https://arxiv.org/pdf/2411.07111;

6)WavChat-Survey:https://arxiv.org/abs/2411.13577;

7)SpeechLLM-Survey:https://arxiv.org/pdf/2410.18908v2;

8)VoiceBench:https://arxiv.org/pdf/2410.17196;

9)SPIRIT LM:https://arxiv.org/pdf/2402.05755;

10)DiVA:https://arxiv.org/pdf/2410.02678;

11)SPIRIT LM:https://arxiv.org/abs/2402.05755;

12)SpeechEmotionLlama:https://arxiv.org/pdf/2410.01162;
13)SpeechLM-Survey:https://arxiv.org/pdf/2410.03751;

14)MMAU:https://arxiv.org/pdf/2410.19168;

15)SALMon:https://arxiv.org/abs/2409.07437;

16)EMOVA:https://arxiv.org/pdf/2409.18042;

17)Moshi:https://arxiv.org/pdf/2410.00037;

18)LLaMA-Omni:https://arxiv.org/pdf/2409.06666v1;

19)Ultravox:https://github.com/fixie-ai/ultravox;

20)MoWE-Audio:https://arxiv.org/pdf/2409.06635;

21)AudioBERT:https://arxiv.org/pdf/2409.08199;

22)DeSTA2:https://arxiv.org/pdf/2409.20007;

23)ASRCompare:https://arxiv.org/pdf/2409.00800v1;

24)MooER:https://arxiv.org/pdf/2408.05101;

25)MuChoMusic:https://arxiv.org/abs/2408.01337;

26)Mini-Omni:https://arxiv.org/pdf/2408.16725;

27)FunAudioLLM:https://arxiv.org/pdf/2407.04051v3;

28)Qwen2-Audio:https://arxiv.org/pdf/2407.10759;

29)GAMA:https://arxiv.org/abs/2406.11768;

30)LLaST:https://arxiv.org/pdf/2407.15415;

31)Decoder-only LLMs for STT:https://arxiv.org/pdf/2407.03169;

32)AudioEntailment:https://arxiv.org/pdf/2407.18062;

33)CompA:https://arxiv.org/abs/2310.08753;

34)DeSTA:https://arxiv.org/abs/2406.18871;

35)Audio Hallucination:https://arxiv.org/pdf/2406.08402;

36)CodecFake:https://arxiv.org/abs/2406.07237;

38)SD-Eval:https://arxiv.org/pdf/2406.13340;

39)Speech ReaLLM:https://arxiv.org/pdf/2406.09569;

40)AudioBench:https://arxiv.org/abs/2406.16020;

41)AIR-Bench:https://aclanthology.org/2024.acl-long.109/;

42)Audio Flamingo:https://arxiv.org/abs/2402.01831;

43)VoiceJailbreak:https://arxiv.org/pdf/2405.19103;

44)SALMONN:https://arxiv.org/pdf/2310.13289.pdf;

45)WavLLM:https://arxiv.org/pdf/2404.00656;

46)AudioLM-Survey:https://arxiv.org/abs/2402.13236;

47)SLAM-LLM:https://arxiv.org/pdf/2402.08846;

48)Pengi:https://arxiv.org/pdf/2305.11834.pdf;

49)Qwen-Audio:https://arxiv.org/pdf/2311.07919.pdf;

50)CoDi-2:https://arxiv.org/pdf/2311.18775;

51)UniAudio:https://arxiv.org/abs/2310.00704;

52)Dynamic-SUPERB:https://arxiv.org/abs/2309.09510;

53)LLaSM:https://arxiv.org/pdf/2308.15930.pdf;

54)Segment-level Q-Former:https://arxiv.org/pdf/2309.13963;

55)Prompting LLMs with Speech Recognition:https://arxiv.org/pdf/2307.11795;

56)Macaw-LLM:https://arxiv.org/pdf/2306.09093;

57)SpeechGPT:https://arxiv.org/pdf/2305.11000.pdf;

58)AudioGPT:https://arxiv.org/pdf/2304.12995.pdf

另外,免费音效素材,约1800 种声音,用来做语音合成,视频音效。https://taira-komori.jpn.org/freesoundcn.html

可以收藏。

二、数字人、语音转文本、文本转语音的开源项目

1、7个数字人开源项目

1)Fay:https://github.com/xszyou/Fay

2)Sadtalker:https://sadtalker.github.io/,https://modelscope.cn/studios/CVstudio/cv_human_portrait

3)Hallo:https://fudan-generative-vision.github.io/hallo/#/,https://modelscope.cn/studios/AI-ModelScope/Hallo

4)EchoMimic/_v2:https://badtobest.github.io/echomimic,https://modelscope.cn/studios/BadToBest/BadToBest,https://antgroup.github.io/ai/echomimic_v2/,https://github.com/antgroup/echomimic_v2

5)Wav2Lip:https://github.com/Rudrabha/Wav2Lip

6)MuseTalk:https://github.com/TMElyralab/MuseTalk

7)LivePortrait:https://github.com/KwaiVGI/LivePortrait

2、24个语音到文本项目

1)Moonshine:https://github.com/usefulsensors/moonshine,https://hf-mirror.com/UsefulSensors/moonshine,https://arxiv.org/abs/2410.15608

2)Paraforme:https://github.com/modelscope/FunASR,https://arxiv.org/abs/2206.08317,https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary

3)Whisper-large-v3:https://hf-mirror.com/openai/whisper-large-v3

4)SenseVoice:https://github.com/FunAudioLLM/SenseVoice,https://fun-audio-llm.github.io/pdf/FunAudioLLM.pdf,https://fun-audio-llm.github.io/,https://www.modelscope.cn/studios/iic/SenseVoice

5)Whisper-turbo:https://hf-mirror.com/openai/whisper-large-v3-turbo

6)Qwen2_Audio:https://github.com/QwenLM/Qwen2-audio,https://arxiv.org/abs/2407.107593,https://qwenlm.github.io/blog/qwen2-audio,https://hf-mirror.com/Qwen/Qwen2-Audio-7B-Instruct,https://www.modelscope.cn/studios/qwen/Qwen2-Audio-Instruct-Demo

7)FunASR:https://github.com/alibaba/FunASR

8)ESPnet:https://github.com/espnet/espnet

9)DeepSpeech:https://github.com/mozilla/DeepSpeech,https://deepspeech.readthedocs.io/en/r0.9/,https://linux.cn/article-14233-1.html

10)PaddleSpeech:https://github.com/PaddlePaddle/PaddleSpeech

11)MASRMASR:https://github.com/nobody132/masr,https://blog.csdn.net/HELLOWORLD2424/article/details/12366787

12)SpeechBrain:https://github.com/speechbrain/speechbrain

13)WeNetWeNet:https://github.com/wenet-e2e/wenet,https://arxiv.org/abs/2203.15455

14)ESPnet:https://github.com/espnet/espnet

15)ASRT:https://github.com/nl8590687/ASRT_SpeechRecognition

16)Massively Multilingual Speech:https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/,https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md,https://ai.meta.com/blog/multilingual-model-speech-recognition

17)OpenSeq2Seq:https://github.com/NVIDIA/OpenSeq2Seq

18)Vosk:https://github.com/alphacep/vosk-api

19)Tensorflow ASR:https://github.com/TensorSpeech/TensorFlowASR

20)Athena:https://github.com/athena-team/athena

21)Flashlight ASR:https://github.com/flashlight/wav2letter

22)Reverb:https://github.com/revdotcom/reverb/tree/main/asr

23)KaldiTTS:https://github.com/kaldi-asr/kaldi,https://kaldi-asr.org/

24)Coqui Transcripts:https://github.com/coqui-ai/STT

3、29个文本转语音TTS开源工具

1)TTS Maker:https://ttsmaker.com/zh-cn

2)微软Azure:https://azure.microsoft.com/en-us/products/cognitive-services/text-to-speech/

3)PaddleSpeech:https://github.com/PaddlePaddle/PaddleSpeech

4)VoiceVox:https://github.com/VOICEVOX/voicevox

5)TensorFlowTTS:https://github.com/TensorSpeech/TensorFlowTTS

6)TTSKit:https://github.com/kuangdd/ttskit

7)OpenTTS:https://github.com/synesthesiam/opentts

8)eSpeak NG:https://github.com/espeak-ng/espeak-ng

9)F5-TTS:https://github.com/SWivid/F5-TTS HuggingFace,https://huggingface.co/SWivid/F5-TTS,https://arxiv.org/pdf/2410.06885,https://huggingface.co/spaces/mrfakename/E2-F5-TTS

10)Edge-TTS:https://github.com/rany2/edge-tts

11)ChatTTS:https://github.com/2noise/ChatTTS

12)ChatTTS-ui:https://github.com/jianchang512/ChatTTS-ui

13)Seed-TTS:https://bytedancespeech.github.io/seedtts_tech_report/,https:/arxiv.org/pdf/2406.02430,https://github.com/BytedanceSpeech/seed-tts-eval/

14)Fish Speech:https://github.com/fishaudio/fish-speech,https://fish.audio/zh-CN/

15)GPT-SoVITS:https://github.com/RVC-Boss/GPTSoVITS

16)OpenVoice:https://github.com/myshell-ai/OpenVoice,https://arxiv.org/pdf/2312.01479.pdf

17)Parler-TTS :https://github.com/huggingface/parler-tts

18)FUNAudioLLM-CosyVoice:https://github.com/FunAudioLLM/CosyVoice

19)VoiceCraft:https://github.com/jasonppy/VoiceCraft

20)EmotiVoice:https://github.com/netease-youdao/EmotiVoice

21)MetaVoice-1B:https://github.com/metavoiceio/metavoice-src

22)Voice Engine:https://ai-bot.cn/openai-voice-engine/

23)Bark:https://github.com/suno-ai/bark

24)MaskGCT:https://hf-mirror.com/amphion/MaskGCT

25)Coqui TTS:https://github.com/coqui-ai/tts,https://huggingface.co/spaces/coqui/xtts,https://tts.readthedocs.io/en/dev/models/xtts.html

26)So-VITS-SVC:https://github.com/svc-develop-team/so-vits-svc

27)Mocking Bird址:https://github.com/babysor/MockingBird,https://www.bilibili.com/video/BV17Q4y1B7mY

28)Real-Time-Voice-Cloning:https://github.com/CorentinJ/Real-Time-Voice-Cloning

29)voice-pro:https://github.com/abus-aikorea/voice-pro

参考文献

1、https://github.com/AudioLLMs/Awesome-Audio-LLM

关于我们

老刘,NLP开源爱好者与践行者,主页:https://liuhuanyong.github.io。

对大模型&知识图谱&RAG&文档理解感兴趣,并对每日早报、老刘说NLP历史线上分享、心得交流等感兴趣的,欢迎加入社区,社区持续纳新。

加入会员方式:关注公众号,在后台菜单栏中点击会员社区->会员入群加入


老刘说NLP
老刘,NLP开源爱好者与践行者。主页:https://liuhuanyong.github.io。老刘说NLP,将定期发布语言资源、工程实践、技术总结等内容,欢迎关注。
 最新文章