在人工智能领域,多模态技术正迅速崛起,成为推动下一次重大进步的关键力量。回顾 2024 年生成式人工智能的发展趋势,我大胆预测:多模态模型将在年末前走向主流,甚至成为行业新常态。
传统的人工智能系统往往局限于单一数据类型的处理,例如仅能解析文本。然而,环顾我们的日常生活,世界是多维度的。人类通过多种感官(视觉、听觉、触觉等)感知和理解周围环境,而多模态人工智能的目标正是模仿这种复杂的感知方式。
理想情况下,多模态 AI 应该能够接受多种数据类型的组合——文本、图像、视频、音频等,并在一个统一的模型中加以处理。这种能力将显著提升生成式 AI 的适应性和应用场景的广度,让人与机器的交互更加自然高效。
大型语言模型(LLMs)虽然功能强大,但也存在一定局限性,例如有限的上下文窗口和知识更新的截止日期。为了克服这些不足,可以借助一种名为检索增强生成(Retrieval-Augmented Generation, RAG)的方法。这一过程由两个核心步骤组成:检索和生成。
首先,RAG 在响应用户查询时,会从相关数据源中检索最匹配的信息。随后,这些数据被用来支持 LLM 为用户生成更加精准和上下文相关的回答。
在此过程中,矢量数据库发挥了关键作用。它通过存储文本、图像、音频和视频等多种数据类型的嵌入,支持高效的多模态数据检索。这种方式确保 LLM 不仅能够理解用户查询,还能动态访问最相关的多模态数据,从而生成更丰富、更全面的响应。
在本文中,我们将详细探讨多模态 RAG 的两个关键阶段:检索和生成。首先,我们将介绍两种基于矢量数据库的多模态检索方法,阐释如何高效存储和检索文本与图像数据。接着,在生成阶段,我们将展示如何利用大型语言模型(LLMs),结合用户查询和检索到的数据,生成精准且上下文相关的响应。
通过这两个环节的解析,您将全面了解多模态 RAG 如何发挥其独特优势,为复杂任务提供创新解决方案。
我们的目标是将图像和文本嵌入到统一的向量空间中,从而实现跨两种媒体类型的同时向量搜索。为此,我们通过对数据进行嵌入(将数据转换为数字向量表示)并将其存储在 KDB.AI 向量数据库中来完成。这一过程有多种实现方式,今天我们将探讨两种主要方法:
- 使用多模态嵌入模型同时嵌入文本和图像。
- 使用多模态 LLM 对图像进行摘要处理,然后将摘要和文本数据传递给文本嵌入模型(如 OpenAI 的“text-embedding-3-small”)。
在本文的结尾,我们将结合理论与代码,基于动物图像及描述的数据集,详细介绍如何实现上述两种多模态检索方法。最终,该系统将能够处理用户查询,返回相关图像和文本数据,为多模态检索增强生成(RAG)应用程序提供强大的检索机制。
方法 1:使用多模态嵌入模型嵌入文本和图像
多模态嵌入模型能够将文本、图像以及其他多种数据类型统一映射到一个共享的向量空间中,从而实现跨模态的向量相似性搜索。这一能力极大地增强了 KDB.AI 矢量数据库在处理多模态数据时的灵活性与效率。
通过检索到的嵌入,系统可以进一步支持多模态大型语言模型(LLM),帮助其生成更精准、上下文相关的响应,完整实现检索增强生成(RAG)管道的闭环。
ImageBind GitHub Repository: https://github.com/facebookresearch/ImageBindImageBind GitHub 存储库: https://github.com/facebookresearch/ImageBind
git clone https://github.com/facebookresearch/ImageBind
import pandas as pd
import os
import PIL
from PIL import Image
import torch
from imagebind import data
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityType
os.chdir('./ImageBind')
!pip install .
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Instantiate the ImageBind model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)
#Helper functions to create embeddings
def getEmbeddingVector(inputs):
with torch.no_grad():
embedding = model(inputs)
for key, value in embedding.items():
vec = value.reshape(-1)
vec = vec.numpy()
return(vec)
def dataToEmbedding(dataIn,dtype):
if dtype == 'image':
data_path = [dataIn]
inputs = {
ModalityType.VISION: data.load_and_transform_vision_data(data_path, device)
}
elif dtype == 'text':
txt = [dataIn]
inputs = {
ModalityType.TEXT: data.load_and_transform_text(txt, device)
}
vec = getEmbeddingVector(inputs)
return(vec)
# Helper function to read the text from a file
def read_text_from_file(filename):
try:
# Open the file in read mode ('r')
with open(filename, 'r') as file:
# Read the contents of the file into a string
text = file.read()
return text
except IOError as e:
# Handle any I/O errors
print(f"An error occurred: {e}")
return None
#Define a dataframe to put our embeddings and metadata into
#this will later be used to load our vector database
columns = ['path','media_type','embeddings']
df = pd.DataFrame(columns=columns)
#Get a list of paths for images, text
images = os.listdir("./data/images")
texts = os.listdir("./data/text")
#loop through images, append a row in the dataframe containing each
# image's path, media_type (image), and embeddings
for image in images:
path = "./data/images/" + image
media_type = "image"
embedding = dataToEmbedding(path,media_type)
new_row = {'path': path,
'media_type':media_type,
'embeddings':embedding}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
#loop through texts, append a row in the dataframe containing each
# text's path, media_type (text), and embeddings
for text in texts:
path = "../data/text/" + text
media_type = "text"
txt_file = read_text_from_file(path)
embedding = dataToEmbedding(txt_file,media_type)
new_row = {'path': path,
'media_type':media_type,
'embeddings':embedding}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
import os
from getpass import getpass
import kdbai_client as kdbai
import time
#Set up KDB.AI endpoing and API key
#Go to kdb.ai to sign up for free if you don't already have an account!
KDBAI_ENDPOINT = (
os.environ["KDBAI_ENDPOINT"]
if "KDBAI_ENDPOINT" in os.environ
else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
os.environ["KDBAI_API_KEY"]
if "KDBAI_API_KEY" in os.environ
else getpass("KDB.AI API key: ")
)
#connect to KDB.AI
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)
# Connect to the default database in KDB.AI
db = session.database('default')
#Set up a table with three columns, path, media_type, and embeddings
table_schema = [
{"name": "path", "type": "str"},
{"name": "media_type", "type": "str"},
{
"name": "embeddings",
"type": "float64s",
},
]
# Define the index
indexes = [
{
'type': 'flat',
'name': 'flat_index',
'column': 'embeddings',
'params': {'dims': 1024, 'metric': "CS"},
},
]
确保不存在同名表,然后创建一个名为“multi_modal_demo”的表:
# First ensure the table does not already exist
try:
db.table("multi_modal_ImageBind").drop()
except kdbai.KDBAIException:
pass
#Create the table called "multi_modal_demo"
table = db.create_table(table="multi_modal_ImageBind", schema=table_schema, indexes=indexes)
将我们的数据加载到 KDB.AI 表中!
#Insert the data into the table, split into 2000 row batches
from tqdm import tqdm
n = 2000 # chunk row size
for i in tqdm(range(0, df.shape[0], n)):
table.insert(df[i:i+n].reset_index(drop=True))
# Helper function to create a query vector from a natural language query
def QuerytoEmbedding(text):
text = [text]
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text, device)
}
vec = getEmbeddingVector(inputs)
return(vec)
# Helper function to view the results of our similarity search
def viewResults(results):
for index, row in results[0].iterrows():
if row[1] == 'image':
image = Image.open(row[0])
display(image)
elif row[1] == 'text':
text = read_text_from_file(row[0])
print(text)
# Multimodal search function, identifies most relevant images and text within the vector database
def mm_search(query):
image_results = table.search(vectors={"flat_index":query}, n=2, filter=[("like", "media_type", "image")])
text_results = table.search(vectors={"flat_index":query}, n=1, filter=[("like", "media_type", "text")])
results = [pd.concat([image_results[0], text_results[0]], ignore_index=True)]
viewResults(results)
return(results)
让我们进行相似性搜索来检索与查询最相关的数据:
#Create a query vector to do similarity search against
query_vector = [QuerytoEmbedding("brown animal with antlers").tolist()]
#Execute a multimodal similarity search
results = mm_search(query_vector)
import base64
# Helper function to convert a file to base64 representation
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Takes in a base64 encoded image and prompt (requesting an image summary)
# Returns a response from the LLM (image summary)
def image_summarize(img_base64,prompt):
''' Image summary '''
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{img_base64}",
},
},
],
}
],
max_tokens=150,
)
content = response.choices[0].message.content
return content
def TexttoEmbedding(text):
embeddings = openai.Embedding.create(
input=text,
model="text-embedding-3-small"
)
embedding = embeddings["data"][0]["embedding"]
return(embedding)
#Define a dataframe to put our embeddings and metadata into
#this will later be used to load our vector database
columns = ['path','media_type','text','embeddings']
df = pd.DataFrame(columns=columns)
#Get a list of paths for images, text
images = os.listdir("./data/images")
texts = os.listdir("./data/text")
# Embed texts, store relevant info in data frame
for text in texts:
path = "./data/text/" + text
media_type = "text"
text1 = read_text_from_file(path)
embedding = TexttoEmbedding(text1)
new_row = {'path': path,
'media_type':'text',
'text' : text1,
'embeddings': embedding}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
# Encode images with base64 encoding,
# Get text summary for encoded images,
# Embed summary, store relevant info in data frame
for image in images:
path = "./data/images/" + image
media_type = "images"
base64_image = encode_image(path)
prompt = "Describe the image in detail."
summarization = image_summarize(base64_image,prompt)
embedding = TexttoEmbedding(summarization)
new_row = {'path': path,
'media_type':'image',
'text' : summarization,
'embeddings': embedding}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
database = session.database("default")
# Define table schema for our table with columns for path, media_type, text, and embeddings
table_schema = [
{"name": "path", "type": "str"},
{"name": "media_type", "type": "str"},
{"name": "text", "type": "str"},
{"name": "embeddings", "type": "float64s"}
]
# Define the index that will be applied to the embeddings column defined above
indexes = [
{
"name": "flat_index",
"column": "embeddings",
"type": "flat",
"params": {"dims": 1536, "metric": "CS"}
}
]
# First ensure the table does not already exist
try:
database.table("multi_modal_demo").drop()
except kdbai.KDBAIException:
pass
# Create the table called "multi_modal_demo"
table = database.create_table("multi_modal_demo", schema=table_schema, indexes=indexes)
让我们填充我们的表:
#Insert the data into the table, split into 2000 row batches
from tqdm import tqdm
n = 2000 # chunk row size
for i in tqdm(range(0, df.shape[0], n)):
table.insert(df[i:i+n].reset_index(drop=True))
query_vector = [TexttoEmbedding("animals with antlers")]
results = table.search({index_name: query_vector}, n=3)
viewResults(results)
数据预处理
摄取数据,将其分块或汇总后生成嵌入向量。 嵌入存储
将生成的嵌入向量存储在向量数据库中,供后续检索使用。 语义搜索
用户查询被嵌入为向量,并与向量数据库中的数据进行语义相似性搜索。 数据检索
从数据库中检索与查询最相关的数据。在多模态 RAG 场景下,这些数据可能包含文本和图像。 数据传递
检索到的相关数据与用户查询一并传递给大型语言模型(LLM)。 生成响应
LLM 利用检索到的数据作为上下文,为用户生成针对性且精准的回答。
import google.generativeai as genai
genai.configure(api_key = os.environ['GOOGLE_API_KEY'])
# Use Gemini Pro Vision model to handle multimodal inputs
vision_model = genai.GenerativeModel('gemini-1.5-flash')
# Generate a response by passing in a list containing:
# the user prompt and retieved text & image data
response = vision_model.generate_content(retrieved_data_for_RAG)
print(response.text)
# Helper function to execute RAG, we pass in a list of retrieved data, and the prompt
# The function returns the LLM's response
def RAG(retrieved_data,prompt):
messages = ""
messages += prompt + "\n"
if retrieved_data:
for data in retrieved_data:
messages += data + "\n"
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": messages},
],
}
],
max_tokens=300,
)
content = response.choices[0].message.content
return content
将多种数据类型(如文本和图像)集成到 LLMs 中,显著提升了其生成全面响应的能力。这种集成使人工智能能够更好地理解复杂查询,并提供更符合上下文的答案。本文着重探讨了 KDB.AI 等矢量数据库在多模态检索系统中的核心作用,通过耦合相关多模态数据与 RAG 应用程序中的 LLMs,实现了更强大的生成能力。
尽管我们今天的重点是图像和文本,但未来可以整合更多数据类型(如音频和视频),进一步扩展人工智能的适用场景,为新型 AI 应用开辟更广阔的可能性。
多模态 RAG 的理念不仅是技术进步的体现,更是在人工智能领域中模仿类人感知的重要尝试。那么,您如何看待多模态 RAG 的未来发展?有哪些应用场景是您特别期待的?欢迎一起讨论!