能在手机上实时运行的超轻量级虚拟人

文摘   2024-10-30 15:26   北京  

项目简介

一个能在移动设备上实时运行的数字人模型,据我所知,这应该是第一个开源的如此轻量级的数字人模型。


Train

It's so easy to train your own digital human.I will show you step by step.

训练一个你自己的数字人非常简单,我将一步步向你展示。

install pytorch and other libs

conda create -n dh python=3.10conda activate dhconda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidiaconda install mkl=2024.0pip install opencv-pythonpip install transformerspip install numpy==1.23.5pip install soundfilepip install librosapip install onnxruntime

I only ran on pytorch==1.13.1, Other versions should also work.

我是在1.13.1版本的pytorch跑的,其他版本的pytorch应该也可以。

Download wenet encoder.onnx from https://drive.google.com/file/d/1e4Z9zS053JEWl6Mj3W9Lbc9GDtzHIg6b/view?usp=drive_link

and put it in data_utils/


Data preprocessing

Prepare your video, 3~5min is good. Make sure that every frame of the video has the person's full face exposed and the sound is clear without any noise, put it in a new folder.I will provide a demo video.

准备好你的视频,3到5分钟的就可以,必须保证视频中每一帧都有整张脸露出来的人物,声音清晰没有杂音,把它放到一个新的文件夹里面。我会提供一个demo视频,来自康辉老师的口播,侵删。

First of all, we need to extract audio feature.I'm using 2 different extractor from wenet and hubert, thank them for their great work.

首先我们需要提取音频特征,我用了两个不同的特征提取起,分别是wenet和hubert,感谢他们。

When you using wenet, you neet to ensure that your video frame rate is 20, and for hubert,your video frame rate should be 25.

如果你选择使用wenet的话,你必须保证你视频的帧率是20fps,如果选择hubert,视频帧率必须是25fps。

In my experiments, hubert performs better, but wenet is faster and can run in real time on mobile devices.

在我的实验中,hubert的效果更好,但是wenet速度更快,可以在移动端上实时运行

And other steps are in data_utils/process.py, you just run it like this.

其他步骤都写在data_utils/process.py里面了,没什么特别要注意的。

cd data_utilspython process.py YOUR_VIDEO_PATH --asr hubert

Then you wait.

然后等它运行完就行了


train

After the preprocessing step, you can start training the model.

上面步骤结束后,就可以开始训练模型了。

Train a syncnet first for better results.

先训练一个syncnet,效果会更好。

cd ..python syncnet.py --save_dir ./syncnet_ckpt/ --dataset_dir ./data_dir/ --asr hubert

Then find a best one(low loss) to train digital human model.

然后找一个loss最低的checkpoint来训练数字人模型。

cd ..python train.py --dataset_dir ./data_dir/ --save_dir ./checkpoint/ --asr hubert --use_syncnet --syncnet_checkpoint syncnet_ckpt

inference

Before run inference, you need to extract test audio feature(i will merge this step and inference step), run this

在推理之前,需要先提取测试音频的特征(之后会把这步和推理合并到一起去),运行

python data_utils/hubert.py --wav your_test_audio.wav  # when using hubert
or
python data_utils/python wenet_infer.py your_test_audio.wav # when using wenet

then you get your_test_audio_hu.npy or your_test_audio_wenet.npy

then run

python inference.py --asr hubert --dataset ./your_data_dir/ --audio_feat your_test_audio_hu.npy --save_path xxx.mp4 --checkpoint your_trained_ckpt.pth

To merge the audio and the video, run

ffmpeg -i xxx.mp4 -i your_audio.wav -c:v libx264 -c:a aac result_test.mp4

项目链接

http://github.com/anliyuan/Ultralight-Digital-Human

扫码加入技术交流群,备注开发语言-城市-昵称

合作请注明


 

关注「GitHubStore」公众号


GitHubStore
分享有意思的开源项目
 最新文章