一、优化被检索的知识内容
1、结构化处理:从无序到有序的转变
案例:将非结构化文本转化为表格
import pandas as pd
# 模拟医疗报告数据
data = {
'patient_id': [101, 102, 103],
'diagnosis': ['Diabetes', 'Hypertension', 'Coronary Artery Disease'],
'treatment': ['Insulin', 'Lisinopril', 'Aspirin'],
'doctor_notes': ['Patient responds well to treatment', 'Blood pressure needs monitoring', 'Recommend lifestyle changes']
}
# 转化为DataFrame
df = pd.DataFrame(data)
print(df)
结果:
patient_id diagnosis treatment doctor_notes
0 101 Diabetes Insulin Patient responds well to treatment
1 102 Hypertension Lisinopril Blood pressure needs monitoring
2 103 Coronary Artery Disease Aspirin Recommend lifestyle changes
2、标准化处理:确保数据一致性
案例:标准化处理时间格式
from datetime import datetime
# 原始事件记录
event_data = ['12-08-2021', '08/12/2021', '2021.08.12']
# 标准化处理
standardized_dates = [datetime.strptime(date, '%d-%m-%Y').strftime('%Y-%m-%d') for date in event_data]
print(standardized_dates)
结果:
['2021-08-12', '2021-08-12', '2021-08-12']
3、聚焦处理:通过业务场景缩小检索范围
案例:聚焦处理在医疗领域的应用
# 假设我们有一系列文档,其中部分与糖尿病有关
documents = [
"This research discusses the effects of insulin on diabetes treatment.",
"This paper explores hypertension treatment methods.",
"An analysis on the causes of coronary artery disease."
]
# 聚焦处理,筛选出与糖尿病相关的文档
focused_docs = [doc for doc in documents if "diabetes" in doc.lower()]
print(focused_docs)
结果:
['This research discusses the effects of insulin on diabetes treatment.']
二、专注于生成(G)的能力提升
1、模型微调(Finetune):增强领域适应性
案例:对BERT进行微调以应对医疗问答
from transformers import BertForQuestionAnswering, BertTokenizer, Trainer, TrainingArguments
# 加载预训练的BERT模型和tokenizer
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# 假设我们有医疗领域的问答数据集
train_dataset = ... # 数据加载代码省略
# 微调模型
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
2、多轮对话:在上下文中提升答案质量
案例:基于对话的RAG问答系统,以下是一个简单的多轮对话流程示例:
from transformers import Conversation, ConversationalPipeline
# 假设我们使用一个训练好的生成模型
pipeline = ConversationalPipeline(model=model)
# 创建对话
conversation = Conversation("What are the symptoms of diabetes?")
conversation.add_user_input("How is it diagnosed?")
# 模型生成答案
response = pipeline(conversation)
print(response)
3、Prompt Engineering:优化输入提示
案例:Prompt设计在生成问答中的应用
prompt = "Based on the research papers on diabetes treatment, explain the role of insulin and cite the relevant sources."
response = model.generate(prompt)
print(response)
三、专注于检索(R)的能力提升
1、选择合适的Embedding和Rank模型:精准语义匹配
案例:使用Sentence-BERT进行语义检索
from sentence_transformers import SentenceTransformer, util
# 加载Sentence-BERT模型
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# 知识库文档
documents = ["Insulin is important for diabetes treatment.",
"Hypertension is treated with Lisinopril.",
"Aspirin is used for coronary artery disease."]
# 用户问题
query = "What is used to treat diabetes?"
# 将文档和查询向量化
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)
# 计算相似度
similarities = util.cos_sim(query_embedding, doc_embeddings)
most_similar_doc = documents[similarities.argmax()]
print(most_similar_doc)
结果:
'Insulin is important for diabetes treatment.'
2、引入Rerank与精排机制:提高检索精度
案例:简单的Rerank机制
# 初步检索结果
documents = [
{"text": "Older study on insulin", "date": "2010", "citations": 50},
{"text": "Recent study on insulin", "date": "2022", "citations": 10},
]
# 基于业务规则重新排序
reranked_docs = sorted(documents, key=lambda x: x['date'], reverse=True)
print(reranked_docs)
结果:
[{'text': 'Recent study on insulin', 'date': '2022', 'citations': 10}, {'text': 'Older study on insulin', 'date': '2010', 'citations': 50}]
3、问题改写(Rewrite):提升问题的表达清晰度
案例:问题改写提高检索精度
from transformers import pipeline
# 使用一个简单的问答改写模型
question_rewriter = pipeline("text2text-generation", model="t5-small")
# 用户问题
original_question = "How can I manage diabetes?"
# 改写问题
rewritten_question = question_rewriter(original_question)
print(rewritten_question)
通过改写,系统可能将问题重构为更具搜索指向性的问题,如“Effective methods to manage diabetes”。这有助于系统更好地找到相关文档。
4、自行判断问题可回答性:提升模型的准确性
案例:基于置信度的可回答性判断
# 假设我们通过生成模型计算置信度
def can_answer(query, docs):
# 模拟模型返回的置信度
confidence = model.predict_confidence(query, docs)
return confidence > 0.5
# 判断是否能回答
if can_answer("What is diabetes?", documents):
print("Generating answer...")
else:
print("Unable to answer the question.")