我之前分享过关于bark的内容【AIGC 学习】Bark Text-To-Speech,但这个工具最初只能生成不超过13秒的音频,不过他们在上个月更新了新的版本,专为长音频生产设计。
https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb
在开始之前,我们还需要先下载相关环境。
#@title 安装环境 - 无论生产什么音频都需要运行
! pip install git+https://github.com/suno-ai/bark.git
from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio
import os
import nltk
nltk.download('punkt')
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import nltk # we'll use this to split into sentences
import numpy as np
from bark.generation import (
generate_text_semantic,
preload_models,
)
from bark.api import semantic_to_waveform
preload_models()
我尝试了一些长篇幅的例子:
#@title 生产长音频
speaker = "v2/en_speaker_6"
script = """
Hey, have you heard about this new text-to-audio model called "Bark"?
Apparently, it's the most realistic and natural-sounding text-to-audio model
out there right now. People are saying it sounds just like a real person speaking.
I think it uses advanced machine learning algorithms to analyze and understand the
nuances of human speech, and then replicates those nuances in its own speech output.
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts.
In fact, I heard that some publishers are already starting to use Bark to create audiobooks.
It would be like having your own personal voiceover artist. I really think Bark is going to
be a game-changer in the world of text-to-audio technology! [end]
""".replace("\n", " ").strip()
sentences = nltk.sent_tokenize(script)
GEN_TEMP = 0.6
silence = np.zeros(int(0.1 * SAMPLE_RATE))
pieces = []
for sentence in sentences:
semantic_tokens = generate_text_semantic(
sentence,
history_prompt=speaker,
temp=GEN_TEMP,
min_eos_p=0.05,
)
audio_array = semantic_to_waveform(semantic_tokens, history_prompt=speaker,)
pieces += [audio_array, silence.copy()]
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)
这是音频效果:
我们也可以生成长对话:
#@title 生产长对话
speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_6"}
script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?
John: No, I haven't. What's so special about it?
Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.
John: Wow, that sounds amazing. How does it work?
Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.
John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?
Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.
John: I can imagine. It would be like having your own personal voiceover artist.
Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script
pieces = []
silence = np.zeros(int(0.1*SAMPLE_RATE))
for line in script:
speaker, text = line.split(": ")
audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
pieces += [audio_array, silence.copy()]
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)
这是音频效果:
支持的音效:
[laughter],[laughs],[sighs],[music],[gasps],[clears throat] — or ... for hesitations ♪ for song lyrics CAPITALIZATION for emphasis of a word 大写字母强调 [MAN] and [WOMAN] 男生和女生
也可以修改的语言和声音:
都效果喜人,而且还是 MIT license,是商业使用友好的哟~