除了前一段给大家分享的微博数据和采集思路外(对微博数据感兴趣可以看城市微博签到数据分享&地址解码与纠偏教程;北京市含地理坐标的微博数据分享&数据获取方法与科学研究问题),最近感觉很火的是小红书数据的采集,小红书号称是中国的instagram,实际上感觉跟微博差不多,主要更侧重图片的形式,大家对微博和小红书数据感兴趣也可以后台私信交流互相学习。
首先是按照话题抓取,类似于加 #考公 这样的分享。这种含有Tag的数据不需要Cookie就可以抓取。从小红书App里导出链接,复制这个链接到浏览器。
可以看到,有一些核心参数,例如page_id来记录这个Tag,cursor来记录翻页等其他信息。
撰写代码进行流水线化提取。
import requests
import pandas as pd
# 初始化变量
page_id = "5bebec9ec3c3cb0001b4efb5"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
}
title_list, note_id_list, author_name_list, author_id_list, create_time_list = [], [], [], [], []
page = 1
next_cursor = "-Vemui03unFY3QeGkzhmoTKlzFL6-5KRd_h0nC8LGpg"
note_data = [] # 用于存储所有笔记的信息
# 爬虫循环
while True:
# 构造URL
if page == 1:
url = f'https://www.xiaohongshu.com/web_api/sns/v3/page/notes?page_size=6&sort=hot&page_id={page_id}&sid='
else:
url = f'https://www.xiaohongshu.com/web_api/sns/v3/page/notes?page_size=6&sort=hot&page_id={page_id}&sid=&cursor={next_cursor}'
# 发送请求
response = requests.get(url, headers=headers)
json_data = response.json()
# 解析数据
for note in json_data['data']['notes']:
note_info = {
'笔记id': note['id'],
'笔记标题': note['title'],
'作者昵称': note['user']['nickname'],
'作者头像': note['user']['images'],
'作者id': note['user']['userid'],
'发布时间': note['create_time'],
'图片数量': note['image_count'],
'图片信息': [img['url_size_large'] for img in note['images_list']],
}
note_data.append(note_info)
# 获取下一页游标
next_cursor = json_data['data']['cursor']
# 判断是否有更多数据
if not json_data['data']['has_more']:
print('没有下一页了,终止循环!')
break
page += 1
# 保存数据到DataFrame
df = pd.DataFrame(note_data)
# 保存到CSV
result_file = 'xiaohongshu_notes.csv'
df.to_csv(result_file, mode='a+', index=False, encoding='utf_8_sig')
最后则是整理出的话题数据,好像有1000条左右。
其实这里抓取的时间是时间戳格式,可以转化一下:
当然这只是带Tag的数据获取,那么如何关键字获取呢?这个时候就需要登录小红书,跟微博差不多吧感觉。检索含有Cookie的请求,记录关键的字段。
格式化为这种json:
此外小红书进行了一定的混淆加密,每次请求都有奇奇怪怪的字符串,我们找大神解码的js文件,使用execjs包进行原生的编译,这种方法比selenium模拟浏览器的方法更强。关键算法采用了Github中cv-cat/Spider_XHS的分享。
请求出的每个笔记的详细信息。
id: None
note_id: 654c7526000000003103dc16
user_id: 63f5e618000000001001de07
nickname: (省略)
avatar: https://sns-avatar-qc.xhscdn.com/avatar/64aa9cc6dfb7520001e0f236.jpg
title: 真正懂公考的都知道,省考才是往届生的天堂
desc: 😅就是说往届生们别挤破脑袋想国考上岸了,一般大部分往届参加国考的不是去刷分就是去积累经验的,而真正懂公考的人都知道,只有省考才是往届生的天堂,国考岗位80%以上都是应届生,专业和应届一套操作限制下来,基本
上没有几个岗位可以选择,即便选择也是竞争激烈。....(省略)
liked_count: 464
collected_count: 456
comment_count: 160
share_count: 13
video_addr:
images: [{'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/63f9abe663ec965f0c7c9f73bccd56e9/1040g2sg30r8bhcv71k005ovlsoc43ng75alpbc0!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/663a9e8c15c358fc16fae80b75904042/1040g2sg30r8bhcv71k005ovlsoc43ng75alpbc0!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1705, 'width': 1280, 'url': '', 'trace_id': ''}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/6b17cbc9599183cdfc8ed12ed021f5c7/1040g2sg30r8bhcv71k0g5ovlsoc43ng7k50hq58!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/0db76d474349bfa8417ff45fead93de9/1040g2sg30r8bhcv71k0g5ovlsoc43ng7k50hq58!nd_whgt34_webp_wm_1'}]}, {'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/217a7c1205bfdbf755c7c83f61fedf3e/1040g2sg30r8bhcv71k105ovlsoc43ng7a5scb6o!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/0848496a23b6170964ed1745cbd5d6be/1040g2sg30r8bhcv71k105ovlsoc43ng7a5scb6o!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/e42d215389a75fd1f82a540295090914/1040g2sg30r8bhcv71k1g5ovlsoc43ng7ovll68o!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/e49ad4f7d528f5552a59f2b87bd7b47e/1040g2sg30r8bhcv71k1g5ovlsoc43ng7ovll68o!nd_whgt34_webp_wm_1'}]}, {'info_list': [{'image_scene': 'CRD_PRV_WEBP',
'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/fa49d164fdc3df8a63a9aea9fbf93bb7/1040g2sg30r8bhcv71k205ovlsoc43ng704gagn8!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/bc6572d5ffd13a5a0d5928a6a463343b/1040g2sg30r8bhcv71k205ovlsoc43ng704gagn8!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': ''}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/ff1d231a2905b28c0ecd455fdb56d920/1040g2sg30r8bhcv71k2g5ovlsoc43ng7oqh7u80!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/3b837b8b073b25e9fac856043856a955/1040g2sg30r8bhcv71k2g5ovlsoc43ng7oqh7u80!nd_whgt34_webp_wm_1'}]}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/fd4803801cec0e957c4dfc6eb828de2f/1040g2sg30r8bhcv71k305ovlsoc43ng7enq393g!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/1c1670b8059b174cb051b1c441d589d0/1040g2sg30r8bhcv71k305ovlsoc43ng7enq393g!nd_whgt34_webp_wm_1'}]}, {'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/77dab91e9477cf7f97cc75e9364f8b28/1040g2sg30r8bhcv71k3g5ovlsoc43ng738v26bg!nd_whgt34_webp_prv_1', 'image_scene': 'CRD_PRV_WEBP'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/db116705075e5dc6712688ef6aa7cbe0/1040g2sg30r8bhcv71k3g5ovlsoc43ng738v26bg!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720}, {'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/62cff2970f52040b66121605a1afdec9/1040g2sg30r8bhcv71k405ovlsoc43ng7pelho68!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/abe999ee14194c82c243c78285642393/1040g2sg30r8bhcv71k405ovlsoc43ng7pelho68!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': ''}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/a2b2fb02421561f666a8c9e9b17a44cb/1040g2sg30r8bhcv71k4g5ovlsoc43ng742e5lho!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/89099c334af5049f51ea927f87e55a1d/1040g2sg30r8bhcv71k4g5ovlsoc43ng742e5lho!nd_whgt34_webp_wm_1'}]}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/28aca59f5a5f61b38e96cc8d16342443/1040g2sg30r8bhcv71k505ovlsoc43ng766ca560!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/47ad43ee9daffb6a30d4c812a91dce02/1040g2sg30r8bhcv71k505ovlsoc43ng766ca560!nd_whgt34_webp_wm_1'}]}]
tag_list: ['省考', '国考', '考公', '公务员考试', '公务员省考', '公务员面试', '公务员']
upload_time: 1699509542000
note_ip_location: 安徽
不知道这些数据有啥用?感觉全量抓取才会更有价值。