urllib
单线程爬取数据确实有点“整活儿的架势”,但速度那真是不敢恭维。1. 多线程/多进程爬虫来一波!
使用 threading
模块
threading
模块可以帮助我们在单进程中并发处理多个任务,尽管 Python 的 GIL(全局解释器锁)可能会让多线程的效果打折扣,但在 I/O 操作方面,还是可以显著提升效率。下面我们用 requests
库(比 urllib
好用)和 threading
试试看。import threading
import requests
def fetch_url(url):
try:
response = requests.get(url)
print(f"Fetched {url} with status: {response.status_code}")
except Exception as e:
print(f"Failed to fetch {url}: {e}")
urls = ["http://example.com/page" + str(i) for i in range(100)] # 假设有100个页面
threads = []
for url in urls:
t = threading.Thread(target=fetch_url, args=(url,))
threads.append(t)
t.start()
for t in threads:
t.join() # 等待所有线程结束
多进程加持:multiprocessing
模块
multiprocessing.Pool
来并发处理。from multiprocessing import Pool
import requests
def fetch_url(url):
try:
response = requests.get(url)
print(f"Fetched {url} with status: {response.status_code}")
except Exception as e:
print(f"Failed to fetch {url}: {e}")
urls = ["http://example.com/page" + str(i) for i in range(100)]
# 创建一个进程池
with Pool(4) as p: # 4个进程
p.map(fetch_url, urls)
线程和进程选哪个好?
2. 协程来扛大旗:asyncio
+ aiohttp
asyncio
库和 aiohttp
库是处理 I/O 密集型任务的神器,可以让我们轻松实现高并发网络请求。这也是我个人的首选方案,性能强悍且更省资源。import aiohttp
import asyncio
async def fetch_url(session, url):
try:
async with session.get(url) as response:
print(f"Fetched {url} with status: {response.status}")
except Exception as e:
print(f"Failed to fetch {url}: {e}")
async def main(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(session, url) for url in urls]
await asyncio.gather(*tasks)
urls = ["http://example.com/page" + str(i) for i in range(100)]
# 运行主协程
asyncio.run(main(urls))
asyncio
是我写爬虫时的标配,如果你没用过,一定要试一试。3. 把数据量做分批次请求
import time
batch_size = 1000
for i in range(0, len(urls), batch_size):
batch = urls[i:i+batch_size]
asyncio.run(main(batch)) # 假设 main 是我们上面定义的异步主函数
time.sleep(5) # 假装自己是“温柔的爬虫”,间歇性休息
4. 减少不必要的请求,缓存没错
requests-cache
库来缓存 GET 请求,避免重复访问相同 URL 时浪费流量。import requests
from requests_cache import CachedSession
session = CachedSession(cache_name="demo_cache", backend="sqlite", expire_after=3600)
def fetch_url(url):
response = session.get(url)
print(f"Fetched {url} from {'cache' if response.from_cache else 'web'} with status: {response.status_code}")
for url in urls:
fetch_url(url)
5. 用代理池搞定IP限制
proxy
库是个不错的选择,也可以搭配一些免费或者付费代理池服务来实现。import requests
proxies = [
"http://proxy1.com:8080",
"http://proxy2.com:8080",
# 更多代理地址
]
def fetch_url(url):
proxy = {"http": proxies[0]} # 可以在每次请求时随机选择一个代理
response = requests.get(url, proxies=proxy)
print(f"Fetched {url} with proxy {proxy}")
for url in urls:
fetch_url(url)
6. 分布式爬虫来一波?
Scrapy
搭配 Scrapy-Redis
或 PySpider
等等。结尾
对编程、职场感兴趣的同学,大家可以联系我微信:golang404,拉你进入“程序员交流群”。
虎哥作为一名老码农,整理了全网最全《python高级架构师资料合集》。