01.
Embeddings 向量与 Embedding 模型简介
Embedding 向量是人工智能(AI)中的一个核心概念,它将复杂的非结构化数据(如图像、文本、视频或音频文件等)以机器可以理解和处理的数值向量来表示。这些向量能够捕捉数据中的语义含义和关系,帮助 AI 模型更有效地分析、比较和生成内容。在自然语言处理(NLP)领域中,单词、句子或整个文档被转换为稠密向量,这样一来,算法不仅能够理解单个词的含义,还能够理解它们之间的上下文关系。
02.
text-embedding-ada-002
模型以及最新的 text-embedding-3-small
和 text-embedding-3-large
模型尤为出色。text-embedding-3-small
和 text-embedding-3-large
模型于 2024 年 1 月 25 日发布,标志了AI 领域的重要进步。它们在 text-embedding-ada-002
奠定的坚实基础上进一步发展。以下表格比较了上述三种模型并介绍了其各自的关键特性:text-embedding-3-large
专用于处理高精度任务——需要捕捉语言的细微差别的任务。该模型输出向量的维度更大,为 3,072 维,能够编码详细的语义信息,因此非常适合复杂的应用,如深度语义搜索、高级推荐系统和复杂的文本分析等。text-embedding-3-small
能够出色地平衡性能和资源效率。该模型的输出向量维度为 1,536 维。相比旧的 text-embedding-ada-002
模型,text-embedding-3-small
模型在保持文本 Embedding 更紧凑的同时,进一步提升了性能。这个模型特别适合实时应用或计算资源有限的场景,因为它在控制模型开销的情况下能够提供高精度。text-embedding-ada-002
在上述两个新模型推出之前,是 OpenAI Embedding 最出色的模型。该模型以较为平衡的方法处理广泛的 NLP 任务。该模型的输出向量维度为 1,536 维。text-embedding-ada-002
模型相较于在它之前发布的模型,性能得到了提升,可以提供高质量 Embedding 向量,适合各种应用包括语义搜索、分类和聚类等。03.
PyMilvus:Milvus 向量数据库的 Python SDK,可以无缝集成 text-embedding-ada-002
之类的模型。OpenAI 库:由 OpenAI 提供的 Python SDK。
注册 Zilliz Cloud 账号。 在 Zilliz Cloud 中创建 Serverless 集群并获取集群公共 Endpoint 和 API 密钥。 在集群中创建 Collection 并插入您的 Embedding 向量。 对存储的向量进行语义搜索。
使用text-embedding-ada-002生成向量并将向量插入 Zilliz Cloud 用于语义搜索
from pymilvus.model.dense import OpenAIEmbeddingFunction
from pymilvus import MilvusClient
OPENAI_API_KEY = "your-openai-api-key"
ef = OpenAIEmbeddingFunction("text-embedding-ada-002", api_key=OPENAI_API_KEY)
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England."
]
# Generate embeddings for documents
docs_embeddings = ef(docs)
queries = ["When was artificial intelligence founded",
"Where was Alan Turing born?"]
# Generate embeddings for queries
query_embeddings = ef(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
uri=ZILLIZ_PUBLIC_ENDPOINT,
token=ZILLIZ_API_KEY)
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
client.drop_collection(collection_name=COLLECTION)
client.create_collection(
collection_name=COLLECTION,
dimension=ef.dim,
auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):
client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(
collection_name=COLLECTION,
data=query_embeddings,
consistency_level="Strong",
output_fields=["text"])
使用text-embedding-3-small生成向量并将向量插入 Zilliz Cloud 用于语义搜索
from pymilvus import model, MilvusClient
OPENAI_API_KEY = "your-openai-api-key"
ef = model.dense.OpenAIEmbeddingFunction(
model_name="text-embedding-3-small",
api_key=OPENAI_API_KEY,
)
# Generate embeddings for documents
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England."
]
docs_embeddings = ef.encode_documents(docs)
# Generate embeddings for queries
queries = ["When was artificial intelligence founded",
"Where was Alan Turing born?"]
query_embeddings = ef.encode_queries(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
uri=ZILLIZ_PUBLIC_ENDPOINT,
token=ZILLIZ_API_KEY)
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
client.drop_collection(collection_name=COLLECTION)
client.create_collection(
collection_name=COLLECTION,
dimension=ef.dim,
auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):
client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(
collection_name=COLLECTION,
data=query_embeddings,
consistency_level="Strong",
output_fields=["text"])
使用text-embedding-3-large生成向量并将向量插入 Zilliz Cloud 用于语义搜索
from pymilvus.model.dense import OpenAIEmbeddingFunction
from pymilvus import MilvusClient
OPENAI_API_KEY = "your-openai-api-key"
ef = OpenAIEmbeddingFunction("text-embedding-3-large", api_key=OPENAI_API_KEY)
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England."
]
# Generate embeddings for documents
docs_embeddings = ef(docs)
queries = ["When was artificial intelligence founded",
"Where was Alan Turing born?"]
# Generate embeddings for queries
query_embeddings = ef(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
uri=ZILLIZ_PUBLIC_ENDPOINT,
token=ZILLIZ_API_KEY)
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
client.drop_collection(collection_name=COLLECTION)
client.create_collection(
collection_name=COLLECTION,
dimension=ef.dim,
auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):
client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(
collection_name=COLLECTION,
data=query_embeddings,
consistency_level="Strong",
output_fields=["text"])
04.
Voyage-large-2-instruct: 一个基于指令优化的 Embedding 模型,针对聚类、分类和检索等任务进行了优化。在拥有明确指令或查询的任务中表现出色。 Cohere Embed v3: 旨在评估查询与文档的相关性及内容质量。该模型提升了文档 reranking,并且能有效处理复杂的数据集。 Google Gecko: 紧凑的模型,从大语言模型中提取知识,能够在保持高效的同时提供强大的检索性能。 Mxbai-embed-2d-large-v1: 特点是双重降维策略,减少了层数和 Embedding 向量维度,使得模型更紧凑,同时保证出色的性能。 Nomic-embed-text-v1: 一个开源、可复用的 Embedding 模型,强调通过开放的训练代码和数据实现透明度和可访问性。 BGE-M3: 支持超过100种语言,并在多语言和跨语言检索任务中表现出色。它能够处理稠密向量搜索、多向量搜索和稀疏向量搜索。
05.
text-embedding-3-small
和 text-embedding-3-large
,它们相比 text-embedding-ada-002
有了显著的提升。这些高级模型在语义搜索、实时处理和高精度应用等任务中提供了更好的性能。作者介绍
Fendy Feng
Technical Marketing Writer at Zilliz