【Python学习笔记】| 小爬虫，大作为！Python新手的爬虫初体验

文摘科技 2024-06-26 09:49 北京

本笔记为个人学习整理（图文版），仅供参考。主要参考内容在文末附有链接。如有侵权，请联系删除。

所有代码仅作学习使用

操作环境：
Python版本：3.11.7
操作系统：Windows 11 23h2

本文主要内容：

什么是爬虫？
了解网页结构
BeautifulSoup解析网页基础: lxml
BeautifulSoup解析网页基础: CSS
BeautifulSoup解析网页基础: 正则表达式
多功能的requests
爬虫示例：百度百科
爬虫示例：豆瓣最新书籍

什么是爬虫？

网络爬虫（Web Crawler），也叫网络蜘蛛（Web Spider）或网络机器人（Web Robot），是一种用于自动化浏览互联网并抓取网页内容的程序。爬虫通过系统地访问和下载网页，获取其中的数据，并按照设定的规则提取和存储有用的信息。

爬虫的工作原理

爬虫的基本工作流程包括以下几个步骤：

启动URL： 从一个或多个初始URL（种子URL）开始，通常是你感兴趣的网站首页或特定页面。
下载页面： 爬虫访问这些URL并下载网页的HTML内容。
解析内容： 使用解析器（如BeautifulSoup或lxml）提取网页中的结构化数据，例如文本、链接、图片等。
提取链接： 从下载的页面中提取所有的超链接，并将这些链接加入待访问的URL队列。
递归访问： 重复上述过程，继续访问新的链接，直到达到预定的停止条件（如最大爬取深度、时间限制、页面数量等）。

为什么要学习爬虫？

对于科学研究工作者来说，学习和使用网络爬虫有以下几大优势：

1. 大规模数据收集

广泛数据源： 爬虫可以从互联网上的各种网站自动收集大量数据，例如文献数据库、新闻网站、社交媒体平台、政府和组织的公开数据等。
高效自动化： 与手动数据收集相比，爬虫能够高效、快速地收集大量数据，节省大量时间和精力。

2. 数据分析和研究

实时数据： 爬虫可以定期抓取数据，帮助你获取最新的研究数据和动态信息。
跨领域研究： 通过爬取不同领域的网站，可以获取跨学科的数据进行综合分析，支持跨领域研究。

3. 市场和社会分析

趋势分析： 通过爬取社交媒体、新闻网站等，可以进行舆情监控和趋势分析，了解公众对某一事件或现象的看法。
竞争分析： 在商业研究中，爬虫可以用于竞争对手分析，收集竞争对手的产品、价格和市场活动信息。4. 文献和数据管理
文献爬取： 自动抓取和整理学术文献，构建自己的文献数据库。
数据清洗： 通过爬虫技术，可以批量处理和清洗数据，提高数据质量。

爬虫技术的实际应用

在科研工作中，爬虫技术有广泛的实际应用：

学术资源收集： 自动从期刊网站、数据库中抓取最新的研究论文和相关数据。
社会科学研究： 爬取社交媒体数据进行社会现象和行为分析。
环境科学研究： 从气象、地理等网站收集环境数据，进行环境变化研究。
生物医学研究： 爬取医学数据库、医院网站，收集病历、临床试验数据等。

如何学习爬虫技术

学习爬虫技术通常包括以下几个步骤：

编程基础： 掌握Python等编程语言，这是编写爬虫程序的基础。
网络基础： 了解HTTP协议、HTML结构和CSS样式等网络知识。
爬虫库和工具： 学习使用Scrapy、BeautifulSoup、Selenium等爬虫库和工具。
数据存储和处理： 学习如何将抓取的数据存储到数据库或文件中，并进行后续的数据处理和分析。
反爬虫和规避： 了解网站的反爬虫机制，并学习如何合法、合理地规避反爬措施。

了解网页结构

了解网页结构是编写有效爬虫的基础。网页通常由HTML（超文本标记语言）构建，HTML定义了网页的内容和结构。下面是对网页结构的详细介绍。

HTML基础

HTML文档由一系列标签（tag）组成，这些标签定义了网页的不同部分。一个典型的HTML文档结构如下：

<!DOCTYPE html>
<html>
<head>
    <title>网页标题</title>
    <meta charset="UTF-8">
    <!-- 其他头部信息 -->
</head>
<body>
    <h1>这是一个标题</h1>
    <p>这是一个段落。</p>
    <a href="https://example.com">这是一个链接</a>
    <!-- 其他页面内容 -->
</body>
</html>

<!DOCTYPE html>：声明文档类型，告知浏览器这是一个HTML5文档。
<html>：``HTML文档的根元素。
<head>：包含文档的元数据，如标题、编码、样式等。
<title>：网页标题，显示在浏览器的标签栏中。
<meta>：提供文档的元信息，如字符编码。
<body>：包含网页的主要内容，如标题、段落、链接等。

常见HTML标签

标题标签：<h1>到<h6>，表示不同级别的标题。
段落标签：<p>，表示一个段落。
链接标签：<a>，表示一个超链接，使用href属性指定链接地址。
图片标签：<img>，表示一个图像，使用src属性指定图像URL。
列表标签：<ul>（无序列表）、<ol>（有序列表）、<li>（列表项）。
表格标签：<table>、<tr>（行）、<td>（单元格）、<th>（表头）。
表单标签：<form>、<input>、<textarea>、<button>，用于创建交互表单。

CSS和JavaScript

除了HTML，网页通常还包含CSS和JavaScript：

CSS（层叠样式表）：用于控制网页的样式和布局。

<style>
    body { background-color: lightblue; }
    h1 { color: navy; margin-left: 20px; }
</style>

JavaScript：用于增加网页的交互性。

<script>
    function displayMessage() {
        alert("Hello, world!");
    }
</script>

DOM（文档对象模型）

当浏览器加载HTML文档时，会解析HTML，构建一个DOM树（Document Object Model）。DOM是HTML文档的编程接口，定义了文档的结构，程序可以使用JavaScript操作DOM以改变页面内容和结构。以下是一个简单的DOM结构示例：

<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
</head>
<body>
    <h1>标题</h1>
    <p>段落</p>
</body>
</html>

DOM树的结构如下：

HTML
  └─HEAD
     └─TITLE
  └─BODY
     └─H1
     └─P

使用Python解析HTML

爬虫程序通常使用库如BeautifulSoup和lxml来解析HTML并提取数据。下面是一个简单的示例，使用BeautifulSoup解析和提取网页内容：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
</head>
<body>
    <h1>标题</h1>
    <p>段落</p>
    <a href="https://example.com">链接</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 获取标题
title = soup.title.string
print("标题:", title)

# 获取所有段落
paragraphs = soup.find_all('p')
for p in paragraphs:
    print("段落:", p.text)

# 获取所有链接
links = soup.find_all('a')
for link in links:
    print("链接:", link.get('href'))

BS解析网页基础: lxml

Beautiful Soup(文末参考链接) 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。它能用你喜欢的解析器和习惯的方式实现文档树的导航、查找、和修改。它会帮你节省数小时甚至数天的工作时间。

BS解析网页基础: CSS

什么是CSS？

CSS（Cascading Style Sheets，层叠样式表）是一种样式表语言，用于描述HTML或XML文档的外观和格式。通过CSS，可以控制网页的布局、颜色、字体等，使网页更具视觉吸引力和用户友好性。CSS通过选择器将样式应用到HTML元素，定义这些元素的显示方式。

CSS的class

Class是一种用于为HTML元素定义样式的选择器。Class选择器允许你为特定的HTML元素应用相同的样式，具有以下特点：

定义：使用class属性定义HTML元素的类名。
应用：在CSS中使用类选择器，通过.符号引用类名，并定义样式规则。

示例HTML文档中定义class：

<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
    <link rel="stylesheet" type="text/css" href="styles.css">
</head>
<body>
    <h1 class="title">标题</h1>
    <p class="content">第一个段落</p>
    <p class="content">第二个段落</p>
    <a href="https://example.com" class="link">链接</a>
</body>
</html>

CSS样式表中定义class选择器：

/* 样式表文件 styles.css */
.title {
    color: navy;
    margin-left: 20px;
}

.content {
    font-size: 16px;
    color: green;
}

.link {
    text-decoration: none;
    color: red;
}

使用BeautifulSoup按class匹配

BeautifulSoup支持使用CSS选择器来选择HTML元素，这使得按Class匹配元素变得非常简单。以下是一些示例：

示例HTML

<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
</head>
<body>
    <h1 class="title">标题</h1>
    <p class="content">第一个段落</p>
    <p class="content">第二个段落</p>
    <a href="https://example.com" class="link">链接</a>
</body>
</html>

使用BeautifulSoup解析HTML

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
</head>
<body>
    <h1 class="title">标题</h1>
    <p class="content">第一个段落</p>
    <p class="content">第二个段落</p>
    <a href="https://example.com" class="link">链接</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 按类名选择所有class为content的元素
content_paragraphs = soup.select('.content')
for p in content_paragraphs:
    print("段落:", p.text)

# 按类名选择class为title的元素
title_element = soup.select_one('.title')
print("标题:", title_element.text)

# 按类名选择所有class为link的元素
links = soup.select('.link')
for link in links:
    print("链接:", link['href'])

详细解释:

创建BeautifulSoup对象：首先从HTML字符串创建一个BeautifulSoup对象。
按类名选择元素：

soup.select('.content')：使用CSS类选择器.content选择所有类名为content的元素。
soup.select_one('.title')：使用CSS类选择器.title选择类名为title的第一个元素。
soup.select('.link')：使用CSS类选择器.link选择所有类名为link的元素。

遍历和提取数据：通过遍历选择到的元素，提取文本或属性数据。

BS解析网页基础: 正则表达式

正则表达式（Regular Expression，简称 regex）是一种用于描述和匹配字符串模式的工具。正则表达式在文本搜索和处理任务中非常强大，可以用于验证字符串格式、查找特定模式的子字符串、替换文本等。

则表达式的基础

基本语法

字符匹配：直接匹配字符，例如a匹配字符'a'。
元字符：具有特殊含义的字符，例如.匹配任意单个字符，*表示前一个字符重复0次或多次。
字符类：使用方括号[]定义匹配的字符集合，例如[abc]匹配'a'、'b'或'c'。
范围：使用连字符-表示字符范围，例如[a-z]匹配所有小写字母。
数量词：指定匹配的次数，例如a{2,4}表示匹配'a'重复2到4次。

示例

import re

# 匹配单个字符
pattern = r'a'
result = re.findall(pattern, 'abc acd aef')
print(result)  # 输出: ['a', 'a', 'a']

# 匹配任意字符
pattern = r'.'
result = re.findall(pattern, 'abc')
print(result)  # 输出: ['a', 'b', 'c']

# 匹配字符类
pattern = r'[aeiou]'
result = re.findall(pattern, 'hello world')
print(result)  # 输出: ['e', 'o', 'o']

# 匹配字符范围
pattern = r'[a-z]'
result = re.findall(pattern, 'Hello 123')
print(result)  # 输出: ['e', 'l', 'l', 'o']

在BeautifulSoup中使用正则表达式

BeautifulSoup与正则表达式结合使用，可以更灵活地查找和提取HTML元素。通过正则表达式，可以匹配特定模式的标签、属性和文本内容。

示例HTML

<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
</head>
<body>
    <h1 class="title">标题</h1>
    <p class="content">第一个段落</p>
    <p class="content">第二个段落</p>
    <a href="https://example.com/page1">链接1</a>
    <a href="https://example.com/page2">链接2</a>
</body>
</html>

使用正则表达式匹配元素

from bs4 import BeautifulSoup
import re

html_doc = """
<!DOCTYPE html>
<html>
<head>
    <title>示例文档</title>
</head>
<body>
    <h1 class="title">标题</h1>
    <p class="content">第一个段落</p>
    <p class="content">第二个段落</p>
    <a href="https://example.com/page1">链接1</a>
    <a href="https://example.com/page2">链接2</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 使用正则表达式匹配所有以 "page" 开头的链接
links = soup.find_all('a', href=re.compile(r'https://example\.com/page\d+'))
for link in links:
    print("链接:", link['href'])

# 使用正则表达式匹配所有以 "内容" 开头的段落
paragraphs = soup.find_all(string=re.compile(r'第一个'))
for p in paragraphs:
    print("段落:", p)

详细解释

创建BeautifulSoup对象：首先从HTML字符串创建一个BeautifulSoup对象。
使用正则表达式匹配链接：

soup.find_all('a', href=re.compile(r'https://example\.com/page\d+'))：使用正则表达式https://example\.com/page\d+匹配所有href属性符合模式的<a>标签。
re.compile(r'https://example\.com/page\d+')：编译正则表达式，匹配https://example.com/后跟page和一个或多个数字的字符串。

使用正则表达式匹配段落内容：

soup.find_all(string=re.compile(r'第一个'))：使用正则表达式第一个匹配所有文本内容包含"第一个"的元素。

多功能的requests

安装requests

首先，确保你已经安装了 requests 库。如果没有安装，可以使用以下命令：

pip install requests

获取网页的方式

使用 requests.get 方法可以获取网页内容。

GET 请求

使用 requests.get 方法进行 GET 请求。

from bs4 import BeautifulSoup
import requests

base_url = "https://baike.baidu.com"
url = "https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"
#webbrowser.open(url)

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

key_innner_yDkYh = soup.find_all("a", {'class': 'innerLink_yDkYh'})
for i in key_innner_yDkYh :
    title = i.get_text()
    link = i.get('href')
    print(f'标题：{title}, 链接：{base_url + link}')

print("==========")

POST请求

使用 requests.post 方法进行 POST 请求。

import requests
from bs4 import BeautifulSoup


post_url = "https://pythonscraping.com/pages/files/form.html"
post_url2 = "https://pythonscraping.com/pages/files/processing.php"

soup = BeautifulSoup(requests.get(post_url).text, "lxml")
print(soup.find("form"))
data = {"firstname": "John Doe", 
        "lastname": "johndoe"
        }

response = requests.post(post_url2, data=data)
print(response.text)

上传图片

使用 requests.post 方法上传图片。

file = {'uploadFile': open('./image.png', 'rb')}
r = requests.post('http://pythonscraping.com/files/processing2.php', files=file)
print(r.text)

import requests
from bs4 import BeautifulSoup

print("==============")

post_loginurl = "https://pythonscraping.com/pages/cookies/login.html"
post_loginurl2 = "https://pythonscraping.com/pages/cookies/welcome.php"

soup = BeautifulSoup(requests.get(post_loginurl).text, "lxml")
print(soup.find("form"))

data = {"username": "johndoe", 
        "password": "password"
        }

response = requests.post(post_loginurl2, data=data)
print(response.text)
print(response.cookies)

Session

使用 requests.Session 可以在多次请求之间保持会话，例如登录后保持会话。通过使用 Session 对象，你可以在多次请求之间共享 cookies 和其他会话数据，从而实现登录后继续访问受保护页面的功能。

import requests
from bs4 import BeautifulSoup

session = requests.Session()
print("==============")

post_loginurl = "https://pythonscraping.com/pages/cookies/login.html"
post_loginurl2 = "https://pythonscraping.com/pages/cookies/welcome.php"

soup = BeautifulSoup(session.get(post_loginurl).text, "lxml")
print(soup.find("form"))

data = {"username": "johndoe", 
       "password": "password"
       }
response = session.post(post_loginurl2, data=data)
print(response.text)

response = session.get('http://pythonscraping.com/pages/cookies/profile.php')
print(f'cookie_info:{response.text}')

爬虫示例：百度百科

目标：爬取百度百科上关于“网络爬虫”的页面，以此为基准，随机选择内部链接爬取下一级页面

from bs4 import BeautifulSoup
import requests
import random
import time

base_url = "https://baike.baidu.com"
url = "https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"
#webbrowser.open(url)

print("===============================================")

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

for j in range(20):
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
    except requests.exceptions.HTTPError as e:  
        print(e)  
        cbreak

    soup = BeautifulSoup(response.text, "lxml")
    try:
        h1_title = soup.find("h1").get_text()
        print(f'第{j + 1}次爬取，标题为:{h1_title}，链接为:{url}')
    except AttributeError as e:
        print("未能找到标题")
        break

    sub_urls = soup.find_all("a", {'class': 'innerLink_yDkYh',
                                            'target': '_blank'})
    
    if sub_urls != []:
        sub_url = random.choice(sub_urls)['href']
        url = base_url + sub_url
        print(f'第{j + 1}次爬取随机选择的子链接为:{url}\n =============')
    else:
        print("未找到符合条件的子链接，结束爬取。")
        break

print("successfully!")

爬虫示例：豆瓣最新书籍

目标：爬取豆瓣最新书籍的书名、作者、出版社、出版年、价格等信息，并保存到CSV文件中。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

def get_new_books_douban():
    url = "https://book.douban.com/latest"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")
    return soup.select("ul.chart-dashed-list li")

def get_book_info(soup, info_key):
    try:
        return soup.find('span', string=f'{info_key}:').next_sibling.strip()
    except AttributeError:
        return f"暂未查询到{info_key}信息"

def get_new_books_douban_book_info(new_perbooks_url):
    book_info_response = requests.get(new_perbooks_url, headers=headers)
    book_info_soup = BeautifulSoup(book_info_response.text, "lxml")
    
    author = book_info_soup.find('div', id='info').find('a').get_text(strip=True)
    publish = book_info_soup.find('div', id='info').find_all('a')[1].get_text(strip=True)
    year = get_book_info(book_info_soup, '出版年')
    price = get_book_info(book_info_soup, '定价')

    return {
        "链接": new_perbooks_url,
        "作者": author,
        "出版社": publish,
        "出版年": year,
        "价格": price
    }

def get_new_books_allinfo():
    books_info = []
    new_books_contents = get_new_books_douban()

    for book in new_books_contents:
        book_name = book.select_one('a.fleft').get_text(strip=True)
        book_url = book.find('a')['href']
        book_details = get_new_books_douban_book_info(book_url)
        books_info.append({"书名": book_name, **book_details})
        time.sleep(0.5)  # 添加适当的延迟以避免请求过快，遵守网站访问规则

    # 将书籍信息存储到DataFrame
    df = pd.DataFrame(books_info)
    
    # 将DataFrame保存为CSV文件
    df.to_csv('douban_books.csv', index=False, encoding='utf-8-sig')
    
    print("书籍信息已保存到 douban_books.csv 文件")

if __name__ == '__main__':
    get_new_books_allinfo()

主要参考：
莫烦Python
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

本文为个人学习笔记，整理过程难免有误。如有错误，欢迎指正。仅供个人学习使用，如有侵权，请联系删除

http://mp.weixin.qq.com/s?__biz=MzkyMjYzNjIxNA==&mid=2247485451&idx=1&sn=491eb96f339db235e0154150d8fd65fd

可凡的学习笔记本

在读硕士生，R、Python爱好者