基于大型语言模型爬虫项目Crawl4AI介绍

文摘 2024-10-11 19:57 云南

点击上方蓝字关注我们

Crawl4AI是一款专为大型语言模型（LLMs）和AI应用设计的开源网页爬虫和数据提取工具。最近挺火的开源AI网络爬虫工具：Crawl4AI 可以直接用于大语言模型和AI应用。性能超快，还能输出适合大语言模型的格式，比如JSON、清理过的HTML和markdown。它还支持同时爬取多个网址，能提取所有媒体标签（图片、音频、视频），以及所有内外部链接。可以自定义用户代理，还能给网页截图，甚至在爬取之前执行自定义JavaScript。

一、概述

Crawl4AI旨在简化网页数据的抓取和提取过程，为数据科学家、研究人员、开发人员和AI应用开发者提供高效、便捷的数据获取解决方案。它针对LLM的训练和应用场景进行了优化，能够高效地从各类网站中提取高质量数据，为LLM提供丰富且优质的训练素材。

二、核心优势

开源免费：Crawl4AI是一款完全免费且开源的工具，用户可以自由地使用、修改和分发。
LLM友好：支持LLM友好的输出格式，如JSON、清洁的HTML和Markdown，方便用户进行后续的数据处理和模型训练。
多URL支持：同时支持爬取多个URL，提高数据抓取的效率。
高级提取策略：提供多种高级提取策略，如余弦聚类、LLM等，帮助用户更精确地提取所需数据。
丰富功能：支持提取并返回所有媒体标签（图片、音频和视频），提取所有外部和内部链接，从页面提取元数据等。
灵活配置：提供自定义钩子用于认证、头部和爬取前的页面修改，用户代理自定义，截取页面屏幕截图，执行多个自定义JavaScript脚本等功能，满足用户多样化的需求。

三、应用场景

Crawl4AI适用于需要从网页中快速提取大量数据的场景，如：

数据科学研究：为数据科学家提供丰富的数据源，支持其进行数据分析、挖掘和建模。
AI模型训练：为AI应用开发者提供高质量的训练数据，助力其提升模型的性能和准确性。
网络数据挖掘：帮助研究人员从大量网页中挖掘有价值的信息，支持其进行学术研究。

四、使用指南

Crawl4AI提供了详细的官方文档和安装指南，用户可以通过以下方式获取和使用：

访问官方网站：前往Crawl4AI官方网站或官方文档网站，了解工具的详细介绍和使用方法。
安装工具：用户可以通过pip安装Crawl4AI，也可以使用Docker容器来简化设置。在安装过程中，如果遇到与Playwright相关的错误，可以尝试手动安装Playwright。
编写爬虫代码：根据官方文档提供的示例代码和API文档，编写符合自己需求的爬虫代码。
运行爬虫：执行编写好的爬虫代码，开始从网页中提取数据。

五、步骤示例

第 1 步：安装和设置

pip install “crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"

第 2 步：数据提取

from crawl4ai import WebCrawler# Create an instance of WebCrawlercrawler = WebCrawler()# Warm up the crawler (load necessary models)crawler.warmup()# Run the crawler on a URLresult = crawler.run(url="https://openai.com/api/pricing/")# Print the extracted contentprint(result.markdown)

第 3 步：使用 LLM

使用 LLM （Large Language Model）定义提取策略，并将提取的数据转换为结构化格式：

import osfrom crawl4ai import WebCrawlerfrom crawl4ai.extraction_strategy import LLMExtractionStrategyfrom pydantic import BaseModel, Fieldclass OpenAIModelFee(BaseModel):     model_name: str = Field(..., description="Name of the OpenAI model.")     input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")     output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")url = 'https://openai.com/api/pricing/'crawler = WebCrawler()crawler.warmup()result = crawler.run(         url=url,         word_count_threshold=1,         extraction_strategy= LLMExtractionStrategy(             provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),             schema=OpenAIModelFee.schema(),             extraction_type="schema",             instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.             Do not miss any models in the entire content. One extracted model JSON format should look like this:             {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""         ),                    bypass_cache=True,     )print(result.extracted_content)

第 4 步：与 AI 智能体集成

将 Crawl 与 Praison CrewAI 代理集成以实现高效的数据处理：

pip install praisonai

创建一个工具文件（tools.py）以封装 Crawl 工具：

# tools.pyimport osfrom crawl4ai import WebCrawlerfrom crawl4ai.extraction_strategy import LLMExtractionStrategyfrom pydantic import BaseModel, Fieldfrom praisonai_tools import BaseToolclass ModelFee(BaseModel):     llm_model_name: str = Field(..., description="Name of the model.")     input_fee: str = Field(..., description="Fee for input token for the model.")     output_fee: str = Field(..., description="Fee for output token for the model.")class ModelFeeTool(BaseTool):     name: str = "ModelFeeTool"     description: str = "Extracts model fees for input and output tokens from the given pricing page."    def _run(self, url: str):         crawler = WebCrawler()         crawler.warmup()        result = crawler.run(             url=url,             word_count_threshold=1,             extraction_strategy= LLMExtractionStrategy(                 provider="openai/gpt-4o",                 api_token=os.getenv('OPENAI_API_KEY'),                 schema=ModelFee.schema(),                 extraction_type="schema",                 instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.                 Do not miss any models in the entire content. One extracted model JSON format should look like this:                 {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""             ),                        bypass_cache=True,         )         return result.extracted_contentif __name__ == "__main__":     # Test the ModelFeeTool     tool = ModelFeeTool()     url = "https://www.openai.com/pricing"     result = tool.run(url)     print(result)

将 AI 代理配置为使用 Crawl 工具进行 Web 抓取和数据提取：

framework: crewaitopic: extract model pricing from websitesroles:   web_scraper:     backstory: An expert in web scraping with a deep understanding of extracting structured       data from online sources. https://openai.com/api/pricing/ https://www.anthropic.com/pricing https://cohere.com/pricing     goal: Gather model pricing data from various websites     role: Web Scraper     tasks:       scrape_model_pricing:         description: Scrape model pricing information from the provided list of websites.         expected_output: Raw HTML or JSON containing model pricing data.     tools:     - 'ModelFeeTool'   data_cleaner:     backstory: Specialist in data cleaning, ensuring that all collected data is accurate       and properly formatted.     goal: Clean and organize the scraped pricing data     role: Data Cleaner     tasks:       clean_pricing_data:         description: Process the raw scraped data to remove any duplicates and inconsistencies,           and convert it into a structured format.         expected_output: Cleaned and organized JSON or CSV file with model pricing           data.     tools:     - ''   data_analyzer:     backstory: Data analysis expert focused on deriving actionable insights from structured       data.     goal: Analyze the cleaned pricing data to extract insights     role: Data Analyzer     tasks:       analyze_pricing_data:         description: Analyze the cleaned data to extract trends, patterns, and insights           on model pricing.         expected_output: Detailed report summarizing model pricing trends and insights.     tools:     - ''dependencies: []

Crawl 是一个强大的工具，它使 AI 代理能够更高效、更准确地执行 Web 爬虫和数据提取任务。

http://mp.weixin.qq.com/s?__biz=MzA4MTY3NzMxMg==&mid=2649782331&idx=1&sn=e2d557404215fc1d529118ca59f21899

Megadotnet

为您介绍各体系平台的新闻，系统研发相关框架，组件，方法，过程，运维，设计。企业IT与互联网信息系统或产品解决方案。开源项目，项目管理。

最新文章

AI辅助Kano模型进行产品开发

AI辅助需求规格描述评审

敏捷过程中的障碍板演进与AI

Nginx安全基线

探索AI驱动Web开发动态UI

TAG与RAG实现评论自动化摘要和标签

AWS 的事件驱动DataMesh架构

如何在本地使用AI检索增强生成（RAG）

您应该了解的三大LLM框架

Meta发布了一个开源的NotebookLM

使用Spring AI和LLM生成Java测试代码

PMP 考试学习助手提示词

RAG系统架构介绍

开发人员使用遗留代码库指南

分布式系统中使用异步管道创建实体

六种概率数据结构的详细解释及应用场景

MariaDB 矢量版-专为人工智能设计

AI辅助安全测试案例某电商供应链平台

LLM与Gamma.ai与Napkin的PPT制作

优秀图书推荐《单元测试：原则、模式和实践》与要点解析

GitLab集成GPT进行自动化CodeReview实战

ChatGPT的终极指南概要

Docker Compose部署GitLab

轻松连接 ChatGPT实现代码审查

10个ChatGPT提示词从书籍中提取所有内容

PodLM.ai播客内容生成平台介绍

Google Illuminate革新学习论文

基于大型语言模型爬虫项目Crawl4AI介绍

Bolt.new平台初体验

五个ChatGPT提示词

14个AI商业应用的介绍

系统设计面试参考-设计Spotify系统

Deepseek AI 与插件Continue代码智能助手

空降高管的管理策略

ChatGPT应用PDF对话导师提示词

Cursor AI应用一些建议

大型语言模型(Large Language Models)的介绍

云设计模式介绍

AI应用Google NotebookLM知识库与音频摘要生成视频

ChatGPT提示词（Prompt）框架

快速创业之全栈技术栈

AI辅助的代码审查CodeReview

10个商业AI提示词

逆向工程OpenAPI O1模型架构

Elasticsearch和向量数据库的快速入门

基于AI知识库RAG的综合窗口智能助手

AI正在改变项目经理的工作方式

Vue.js应用程序容器化部署

国产开源项目XXCloud评价-代码质量走查与评价

教育行业AI应用Cerebrium创建实时RAG语音智能体

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉