今天是2025年01月05日,星期日,北京,天气多云。
我们先开最近语音方面的几个进展,一个是关于语音模型发展的一个总括, Awesome-Audio-LLM,一个是7个数字人开源项目、24个语音到文本项目、29个文本转语音TTS开源工具,这些都可以为语音合成相关产品和demo打下基础。
专题化,体系化,会有更多深度思考。大家一起加油。
一、关于语音模型发展的一个总括
AudioLLMs Awesome-Audio-LLM音频领域大模型集 https://github.com/AudioLLMs/Awesome-Audio-LLM,这个项目很不错,做数字人和语言处理方向的可以关注,我觉得很全面了。
其中对发布的语言模型的列表,如下:
1)TangoFlux:https://arxiv.org/abs/2412.2103;【这里做了补充】
2)MERaLiON-AudioLLM:https://arxiv.org/abs/2412.09818;
3)ADU-Bench:https://arxiv.org/abs/2412.05167;
4)Dynamic-SUPERB Phase-2:https://arxiv.org/pdf/2411.05361;
5)-- AudioLLM:https://arxiv.org/pdf/2411.07111;
6)WavChat-Survey:https://arxiv.org/abs/2411.13577;
7)SpeechLLM-Survey:https://arxiv.org/pdf/2410.18908v2;
8)VoiceBench:https://arxiv.org/pdf/2410.17196;
9)SPIRIT LM:https://arxiv.org/pdf/2402.05755;
10)DiVA:https://arxiv.org/pdf/2410.02678;
11)SPIRIT LM:https://arxiv.org/abs/2402.05755;
12)SpeechEmotionLlama:https://arxiv.org/pdf/2410.01162;
13)SpeechLM-Survey:https://arxiv.org/pdf/2410.03751;
14)MMAU:https://arxiv.org/pdf/2410.19168;
15)SALMon:https://arxiv.org/abs/2409.07437;
16)EMOVA:https://arxiv.org/pdf/2409.18042;
17)Moshi:https://arxiv.org/pdf/2410.00037;
18)LLaMA-Omni:https://arxiv.org/pdf/2409.06666v1;
19)Ultravox:https://github.com/fixie-ai/ultravox;
20)MoWE-Audio:https://arxiv.org/pdf/2409.06635;
21)AudioBERT:https://arxiv.org/pdf/2409.08199;
22)DeSTA2:https://arxiv.org/pdf/2409.20007;
23)ASRCompare:https://arxiv.org/pdf/2409.00800v1;
24)MooER:https://arxiv.org/pdf/2408.05101;
25)MuChoMusic:https://arxiv.org/abs/2408.01337;
26)Mini-Omni:https://arxiv.org/pdf/2408.16725;
27)FunAudioLLM:https://arxiv.org/pdf/2407.04051v3;
28)Qwen2-Audio:https://arxiv.org/pdf/2407.10759;
29)GAMA:https://arxiv.org/abs/2406.11768;
30)LLaST:https://arxiv.org/pdf/2407.15415;
31)Decoder-only LLMs for STT:https://arxiv.org/pdf/2407.03169;
32)AudioEntailment:https://arxiv.org/pdf/2407.18062;
33)CompA:https://arxiv.org/abs/2310.08753;
34)DeSTA:https://arxiv.org/abs/2406.18871;
35)Audio Hallucination:https://arxiv.org/pdf/2406.08402;
36)CodecFake:https://arxiv.org/abs/2406.07237;
38)SD-Eval:https://arxiv.org/pdf/2406.13340;
39)Speech ReaLLM:https://arxiv.org/pdf/2406.09569;
40)AudioBench:https://arxiv.org/abs/2406.16020;
41)AIR-Bench:https://aclanthology.org/2024.acl-long.109/;
42)Audio Flamingo:https://arxiv.org/abs/2402.01831;
43)VoiceJailbreak:https://arxiv.org/pdf/2405.19103;
44)SALMONN:https://arxiv.org/pdf/2310.13289.pdf;
45)WavLLM:https://arxiv.org/pdf/2404.00656;
46)AudioLM-Survey:https://arxiv.org/abs/2402.13236;
47)SLAM-LLM:https://arxiv.org/pdf/2402.08846;
48)Pengi:https://arxiv.org/pdf/2305.11834.pdf;
49)Qwen-Audio:https://arxiv.org/pdf/2311.07919.pdf;
50)CoDi-2:https://arxiv.org/pdf/2311.18775;
51)UniAudio:https://arxiv.org/abs/2310.00704;
52)Dynamic-SUPERB:https://arxiv.org/abs/2309.09510;
53)LLaSM:https://arxiv.org/pdf/2308.15930.pdf;
54)Segment-level Q-Former:https://arxiv.org/pdf/2309.13963;
55)Prompting LLMs with Speech Recognition:https://arxiv.org/pdf/2307.11795;
56)Macaw-LLM:https://arxiv.org/pdf/2306.09093;
57)SpeechGPT:https://arxiv.org/pdf/2305.11000.pdf;
58)AudioGPT:https://arxiv.org/pdf/2304.12995.pdf
另外,免费音效素材,约1800 种声音,用来做语音合成,视频音效。https://taira-komori.jpn.org/freesoundcn.html
可以收藏。
二、数字人、语音转文本、文本转语音的开源项目
1、7个数字人开源项目
1)Fay:https://github.com/xszyou/Fay
2)Sadtalker:https://sadtalker.github.io/,https://modelscope.cn/studios/CVstudio/cv_human_portrait
3)Hallo:https://fudan-generative-vision.github.io/hallo/#/,https://modelscope.cn/studios/AI-ModelScope/Hallo
4)EchoMimic/_v2:https://badtobest.github.io/echomimic,https://modelscope.cn/studios/BadToBest/BadToBest,https://antgroup.github.io/ai/echomimic_v2/,https://github.com/antgroup/echomimic_v2
5)Wav2Lip:https://github.com/Rudrabha/Wav2Lip
6)MuseTalk:https://github.com/TMElyralab/MuseTalk
7)LivePortrait:https://github.com/KwaiVGI/LivePortrait
2、24个语音到文本项目
1)Moonshine:https://github.com/usefulsensors/moonshine,https://hf-mirror.com/UsefulSensors/moonshine,https://arxiv.org/abs/2410.15608
2)Paraforme:https://github.com/modelscope/FunASR,https://arxiv.org/abs/2206.08317,https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
3)Whisper-large-v3:https://hf-mirror.com/openai/whisper-large-v3
4)SenseVoice:https://github.com/FunAudioLLM/SenseVoice,https://fun-audio-llm.github.io/pdf/FunAudioLLM.pdf,https://fun-audio-llm.github.io/,https://www.modelscope.cn/studios/iic/SenseVoice
5)Whisper-turbo:https://hf-mirror.com/openai/whisper-large-v3-turbo
6)Qwen2_Audio:https://github.com/QwenLM/Qwen2-audio,https://arxiv.org/abs/2407.107593,https://qwenlm.github.io/blog/qwen2-audio,https://hf-mirror.com/Qwen/Qwen2-Audio-7B-Instruct,https://www.modelscope.cn/studios/qwen/Qwen2-Audio-Instruct-Demo
7)FunASR:https://github.com/alibaba/FunASR
8)ESPnet:https://github.com/espnet/espnet
9)DeepSpeech:https://github.com/mozilla/DeepSpeech,https://deepspeech.readthedocs.io/en/r0.9/,https://linux.cn/article-14233-1.html
10)PaddleSpeech:https://github.com/PaddlePaddle/PaddleSpeech
11)MASRMASR:https://github.com/nobody132/masr,https://blog.csdn.net/HELLOWORLD2424/article/details/12366787
12)SpeechBrain:https://github.com/speechbrain/speechbrain
13)WeNetWeNet:https://github.com/wenet-e2e/wenet,https://arxiv.org/abs/2203.15455
14)ESPnet:https://github.com/espnet/espnet
15)ASRT:https://github.com/nl8590687/ASRT_SpeechRecognition
16)Massively Multilingual Speech:https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/,https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md,https://ai.meta.com/blog/multilingual-model-speech-recognition
17)OpenSeq2Seq:https://github.com/NVIDIA/OpenSeq2Seq
18)Vosk:https://github.com/alphacep/vosk-api
19)Tensorflow ASR:https://github.com/TensorSpeech/TensorFlowASR
20)Athena:https://github.com/athena-team/athena
21)Flashlight ASR:https://github.com/flashlight/wav2letter
22)Reverb:https://github.com/revdotcom/reverb/tree/main/asr
23)KaldiTTS:https://github.com/kaldi-asr/kaldi,https://kaldi-asr.org/
24)Coqui Transcripts:https://github.com/coqui-ai/STT
3、29个文本转语音TTS开源工具
1)TTS Maker:https://ttsmaker.com/zh-cn
2)微软Azure:https://azure.microsoft.com/en-us/products/cognitive-services/text-to-speech/
3)PaddleSpeech:https://github.com/PaddlePaddle/PaddleSpeech
4)VoiceVox:https://github.com/VOICEVOX/voicevox
5)TensorFlowTTS:https://github.com/TensorSpeech/TensorFlowTTS
6)TTSKit:https://github.com/kuangdd/ttskit
7)OpenTTS:https://github.com/synesthesiam/opentts
8)eSpeak NG:https://github.com/espeak-ng/espeak-ng
9)F5-TTS:https://github.com/SWivid/F5-TTS HuggingFace,https://huggingface.co/SWivid/F5-TTS,https://arxiv.org/pdf/2410.06885,https://huggingface.co/spaces/mrfakename/E2-F5-TTS
10)Edge-TTS:https://github.com/rany2/edge-tts
11)ChatTTS:https://github.com/2noise/ChatTTS
12)ChatTTS-ui:https://github.com/jianchang512/ChatTTS-ui
13)Seed-TTS:https://bytedancespeech.github.io/seedtts_tech_report/,https:/arxiv.org/pdf/2406.02430,https://github.com/BytedanceSpeech/seed-tts-eval/
14)Fish Speech:https://github.com/fishaudio/fish-speech,https://fish.audio/zh-CN/
15)GPT-SoVITS:https://github.com/RVC-Boss/GPTSoVITS
16)OpenVoice:https://github.com/myshell-ai/OpenVoice,https://arxiv.org/pdf/2312.01479.pdf
17)Parler-TTS :https://github.com/huggingface/parler-tts
18)FUNAudioLLM-CosyVoice:https://github.com/FunAudioLLM/CosyVoice
19)VoiceCraft:https://github.com/jasonppy/VoiceCraft
20)EmotiVoice:https://github.com/netease-youdao/EmotiVoice
21)MetaVoice-1B:https://github.com/metavoiceio/metavoice-src
22)Voice Engine:https://ai-bot.cn/openai-voice-engine/
23)Bark:https://github.com/suno-ai/bark
24)MaskGCT:https://hf-mirror.com/amphion/MaskGCT
25)Coqui TTS:https://github.com/coqui-ai/tts,https://huggingface.co/spaces/coqui/xtts,https://tts.readthedocs.io/en/dev/models/xtts.html
26)So-VITS-SVC:https://github.com/svc-develop-team/so-vits-svc
27)Mocking Bird址:https://github.com/babysor/MockingBird,https://www.bilibili.com/video/BV17Q4y1B7mY
28)Real-Time-Voice-Cloning:https://github.com/CorentinJ/Real-Time-Voice-Cloning
29)voice-pro:https://github.com/abus-aikorea/voice-pro
参考文献
1、https://github.com/AudioLLMs/Awesome-Audio-LLM
关于我们
老刘,NLP开源爱好者与践行者,主页:https://liuhuanyong.github.io。
对大模型&知识图谱&RAG&文档理解感兴趣,并对每日早报、老刘说NLP历史线上分享、心得交流等感兴趣的,欢迎加入社区,社区持续纳新。
加入会员方式:关注公众号,在后台菜单栏中点击会员社区->会员入群加入