自适应的网页抓取工具Scrapling

文摘 2024-11-11 09:02 湖北

项目简介

一个为Python设计的超快速、自适应的网页抓取工具，能够自动适应网站变化，显著提高网页抓取性能.

处理因网站更改而导致的网络抓取工具失败的情况？

Scrapling 是一个高性能、智能的 Python 网页抓取库，可以自动适应网站变化，同时显着优于流行的替代方案。无论您是初学者还是专家，Scrapling 都提供强大的功能，同时保持简单性。

from scrapling import Adaptor
# Scrape data that survives website changespage = Adaptor(html, auto_match=True)products = page.css('.product', auto_save=True)# Later, even if selectors change:products = page.css('.product', auto_match=True)  # Still finds them!

主要特征

自适应抓取

🔄智能元素跟踪：使用智能相似系统和集成存储，在网站结构更改后定位先前识别的元素。
🎯灵活查询：使用 CSS 选择器、XPath、文本搜索或正则表达式 - 按照您想要的方式链接它们！
🔍查找相似元素：自动定位与页面上您想要的元素相似的元素（例如：其他产品，例如您在页面上找到的产品）。
🧠智能内容抓取：使用其强大的功能，无需特定选择器即可从多个网站提取数据。

表现

🚀快如闪电：从头开始构建时就考虑到了性能，其性能优于最流行的 Python 抓取库（在我们的测试中，其性能比 BeautifulSoup 高出 237 倍）。
🔋内存效率：优化数据结构以最小化内存占用。
⚡快速 JSON 序列化：JSON 序列化速度比标准 json 库快 10 倍，并且具有更多选项。

开发经验

🛠️强大的导航 API ：轻松地在各个方向遍历 DOM 树并获取您想要的信息（父元素、祖先元素、兄弟元素、子元素、下一个/上一个元素等）。
🧬富文本处理：所有字符串都有内置的正则表达式匹配、清理等方法。所有元素的属性都是只读字典，比带有添加方法的标准字典更快。
📝自动选择器生成：为任何元素创建强大的 CSS/XPath 选择器。
🔌 Scrapy 兼容 API ：Scrapy 用户熟悉的方法和类似的伪元素。
📘类型提示：完整的类型覆盖，以获得更好的 IDE 支持和更少的错误。

入门

让我们通过一个基本示例来演示 Scrapling 的一小部分核心功能：

import requestsfrom scrapling import Adaptor
# Fetch a web pageurl = 'https://quotes.toscrape.com/'response = requests.get(url)
# Create an Adaptor instancepage = Adaptor(response.text, url=url)# Get all strings in the full pagepage.get_all_text(ignore_tags=('script', 'style'))
# Get all quotes, any of these methods will return a list of strings (TextHandlers)quotes = page.css('.quote .text::text')  # CSS selectorquotes = page.xpath('//span[@class="text"]/text()')  # XPathquotes = page.css('.quote').css('.text::text')  # Chained selectorsquotes = [element.text for element in page.css('.quote').css('.text')]  # Slower than bulk query above
# Get the first quote elementquote = page.css('.quote').first  # or [0] or .get()
# Working with elementsquote.html_content  # Inner HTMLquote.prettify()  # Prettified version of Inner HTMLquote.attrib  # Element attributesquote.path  # DOM path to element (List)

为了简单起见，只要您链接返回元素（称为Adaptor对象）或适配器列表（称为Adaptors对象）的方法，所有方法都可以彼此链接在一起

安装

开始清理是一件轻而易举的事情 - 我们只需要至少 Python 3.7 才能工作，其余的要求会随软件包自动安装。

# Using pippip install scrapling
# Or the latest from GitHubpip install git+https://github.com/D4Vinci/Scrapling.git@master

高级功能

智能导航

>>> quote.tag'div'
>>> quote.parent<data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>
>>> quote.parent.tag'div'
>>> quote.children[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>, <data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>, <data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]
>>> quote.siblings[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>, <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>, <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,...]
>>> quote.next  # gets the next element, the same logic applies to `quote.previous`<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>
>>> quote.children.css(".author::text")['Albert Einstein']
>>> quote.has_class('quote')True
# Generate new selectors for any element>>> quote.css_selector'body > div > div:nth-of-type(2) > div > div'
# Test these selectors on your favorite browser or reuse them again in the library in other methods!>>> quote.xpath_selector'//body/div/div[2]/div/div'

如果您的情况需要的不仅仅是元素的父级，您可以迭代任何元素的整个祖先树，如下所示

for ancestor in quote.iterancestors():    # do something with it...

您可以搜索满足函数的元素的特定祖先，您所需要做的就是传递一个以Adaptor对象作为参数的函数，如果条件满足则返回True ，否则返回False ，如下所示：

>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))<data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>

基于内容的选择和查找相似元素

您可以通过多种方式根据文本内容选择元素，这是另一个网站上的完整示例：

>>> response = requests.get('https://books.toscrape.com/index.html')
>>> page = Adaptor(response.text, url=response.url)
>>> page.find_by_text('Tipping the Velvet')  # Find the first element that its text fully matches this text<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
>>> page.find_by_text('Tipping the Velvet', first_match=False)  # Get all matches if there are more[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
>>> page.find_by_regex(r'£[\d\.]+')  # Get the first element that its text content matches my price regex<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
>>> page.find_by_regex(r'£[\d\.]+', first_match=False)  # Get all elements that matches my price regex[<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>, <data='<p class="price_color">£53.74</p>' parent='<div class="product_price"> <p class="pr...'>, <data='<p class="price_color">£50.10</p>' parent='<div class="product_price"> <p class="pr...'>, <data='<p class="price_color">£47.82</p>' parent='<div class="product_price"> <p class="pr...'>, ...]

查找位置和属性与当前元素相似的所有元素

# For this case, ignore the 'title' attribute while matching>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>, <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>, <data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,...]
# You will notice that the number of elements is 19 not 20 because the current element is not included.>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))19
# Get the `href` attribute from all similar elements>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/soumission_998/index.html', 'catalogue/sharp-objects_997/index.html', ...]

为了增加一点复杂性，假设我们出于某种原因想要使用该元素作为起点来获取所有书籍的数据

>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():        print({            "name": product.css('h3 a::text')[0],            "price": product.css('.price_color')[0].re_first(r'[\d\.]+'),            "stock": product.css('.availability::text')[-1].clean()        }){'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}...

项目链接

https://github.com/D4Vinci/Scrapling

扫码加入技术交流群，备注「开发语言-城市-昵称」

合作请注明

关注「GitHubStore」公众号

http://mp.weixin.qq.com/s?__biz=MzkxNjQ4MzMyOA==&mid=2247491578&idx=1&sn=ce66ebc6d2c0c7ae8ea4d4877350f891

GitHubStore

分享有意思的开源项目

最新文章

从文档中提取结构化数据Documind

将Markdown转换为精美海报图片

AI旅行助手

实时语音交互数字人平台VideoChat

图像不适宜内容检测工具NSFW Detector

Logo 生成器

开源机器人自动化项目SimpleAutomation

从构思到成文帮你组织和撰写文档的多智能体系统Kiroku

无所不能先进的混合型人脸识别工具包DeepFace！

保留原排版的PDF文档翻译工具：PDFMathTranslate

自适应的网页抓取工具Scrapling

智能会议记录与分析工具Offmute

超高精度将图像或 PDF 转换为 Markdown 或 JSON

开源向量数据库性能对比: Milvus, Chroma, Qdrant

基于openai破解验证码

基于苹果MLX框架的视频字幕生成工具：MLX-Auto-Subtitled-Video-Generator

浏览器智能助手cerebellum

群控软件LinkAndroid

E2B桌面沙箱：为大型语言模型提供图形桌面环境的沙盒服务

AI会议助手MeetingMind

视频生成神器：genmoai-smol

序列建模利器：Google开源序列建模库

构建可扩展的智能Agent应用框架Bee Agent Framework

KAG：基于 OpenSPG 引擎的知识增强生成框架

能在手机上实时运行的超轻量级虚拟人

AMT-APC自动钢琴伴奏

优雅阅读实时热门新闻的工具NewsNow

超快速的语音转文字工具whisper-turbo-mlx

wechat-article-exporter：微信文章批量下载

开源的飞书文档下载 Chrome 插件：Cloud Document Converter

功能颇为丰富的开源工具：eSearch

微型赛车Racer：开源的微型遥控赛车项目

实时AI图像生成器BlinkShot

上海交通大学开源的非常牛音生成模型 F5-TTS

微软推出的用于1bit大型语言模型推理的官方框架BitNet

一款虚拟试衣应用Virtual Try-On App

文档布局分析工具DocLayout-YOLO

可视化爬虫平台kspider

语义查询引擎LOTUS

Semantic Cache：基于语义相似性而非字面相等的模糊键值存储工具

基于知识图谱的智能问答系统：fact-finder

一款AI agent和RAG应用的监控分析工具：Laminar

Knowledge Table：简化从非结构化文档中提取和探索结构化数据

HAMi：针对 Kubernetes 的异构 AI 计算虚拟化中间件

跨多服务器构建和部署软件komodo

基于视觉模型的 PDF 分块处理工具Chunk My Docs

一款匿名聊天浏览器插件WebChat

从PDF和图片中智能识别并提取表格数据

MGDebugger：多粒度LLM代码调试工具

快速构建强大AI Agent的工具AgentStack

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉