超越微博,小红书数据采集数据、代码和思路

文摘   科技   2023-12-03 09:05   北京  

    除了前一段给大家分享的微博数据和采集思路外(对微博数据感兴趣可以看城市微博签到数据分享&地址解码与纠偏教程北京市含地理坐标的微博数据分享&数据获取方法与科学研究问题),最近感觉很火的是小红书数据的采集,小红书号称是中国的instagram,实际上感觉跟微博差不多,主要更侧重图片的形式,大家对微博和小红书数据感兴趣也可以后台私信交流互相学习。

    首先是按照话题抓取,类似于加 #考公 这样的分享。这种含有Tag的数据不需要Cookie就可以抓取。从小红书App里导出链接,复制这个链接到浏览器。

    可以看到,有一些核心参数,例如page_id来记录这个Tag,cursor来记录翻页等其他信息。

    撰写代码进行流水线化提取。

import requestsimport pandas as pd# 初始化变量page_id = "5bebec9ec3c3cb0001b4efb5"headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',}title_list, note_id_list, author_name_list, author_id_list, create_time_list = [], [], [], [], []page = 1next_cursor = "-Vemui03unFY3QeGkzhmoTKlzFL6-5KRd_h0nC8LGpg"note_data = []  # 用于存储所有笔记的信息# 爬虫循环while True:    # 构造URL    if page == 1:        url = f'https://www.xiaohongshu.com/web_api/sns/v3/page/notes?page_size=6&sort=hot&page_id={page_id}&sid='    else:        url = f'https://www.xiaohongshu.com/web_api/sns/v3/page/notes?page_size=6&sort=hot&page_id={page_id}&sid=&cursor={next_cursor}'    # 发送请求    response = requests.get(url, headers=headers)    json_data = response.json()    # 解析数据    for note in json_data['data']['notes']:        note_info = {            '笔记id': note['id'],            '笔记标题': note['title'],            '作者昵称': note['user']['nickname'],            '作者头像': note['user']['images'],            '作者id': note['user']['userid'],            '发布时间': note['create_time'],            '图片数量': note['image_count'],            '图片信息': [img['url_size_large'] for img in note['images_list']],        }        note_data.append(note_info)    # 获取下一页游标    next_cursor = json_data['data']['cursor']    # 判断是否有更多数据    if not json_data['data']['has_more']:        print('没有下一页了,终止循环!')        break    page += 1# 保存数据到DataFramedf = pd.DataFrame(note_data)# 保存到CSVresult_file = 'xiaohongshu_notes.csv'df.to_csv(result_file, mode='a+', index=False, encoding='utf_8_sig')

    最后则是整理出的话题数据,好像有1000条左右。

    其实这里抓取的时间是时间戳格式,可以转化一下:

    当然这只是带Tag的数据获取,那么如何关键字获取呢?这个时候就需要登录小红书,跟微博差不多吧感觉。检索含有Cookie的请求,记录关键的字段。

    格式化为这种json:

    此外小红书进行了一定的混淆加密,每次请求都有奇奇怪怪的字符串,我们找大神解码的js文件,使用execjs包进行原生的编译,这种方法比selenium模拟浏览器的方法更强。关键算法采用了Github中cv-cat/Spider_XHS的分享。

    请求出的每个笔记的详细信息。

id: Nonenote_id: 654c7526000000003103dc16user_id: 63f5e618000000001001de07nickname: (省略)avatar: https://sns-avatar-qc.xhscdn.com/avatar/64aa9cc6dfb7520001e0f236.jpgtitle: 真正懂公考的都知道,省考才是往届生的天堂desc: 😅就是说往届生们别挤破脑袋想国考上岸了,一般大部分往届参加国考的不是去刷分就是去积累经验的,而真正懂公考的人都知道,只有省考才是往届生的天堂,国考岗位80%以上都是应届生,专业和应届一套操作限制下来,基本 上没有几个岗位可以选择,即便选择也是竞争激烈。....(省略)liked_count: 464collected_count: 456comment_count: 160share_count: 13video_addr:images: [{'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/63f9abe663ec965f0c7c9f73bccd56e9/1040g2sg30r8bhcv71k005ovlsoc43ng75alpbc0!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/663a9e8c15c358fc16fae80b75904042/1040g2sg30r8bhcv71k005ovlsoc43ng75alpbc0!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1705, 'width': 1280, 'url': '', 'trace_id': ''}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/6b17cbc9599183cdfc8ed12ed021f5c7/1040g2sg30r8bhcv71k0g5ovlsoc43ng7k50hq58!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/0db76d474349bfa8417ff45fead93de9/1040g2sg30r8bhcv71k0g5ovlsoc43ng7k50hq58!nd_whgt34_webp_wm_1'}]}, {'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/217a7c1205bfdbf755c7c83f61fedf3e/1040g2sg30r8bhcv71k105ovlsoc43ng7a5scb6o!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/0848496a23b6170964ed1745cbd5d6be/1040g2sg30r8bhcv71k105ovlsoc43ng7a5scb6o!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/e42d215389a75fd1f82a540295090914/1040g2sg30r8bhcv71k1g5ovlsoc43ng7ovll68o!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/e49ad4f7d528f5552a59f2b87bd7b47e/1040g2sg30r8bhcv71k1g5ovlsoc43ng7ovll68o!nd_whgt34_webp_wm_1'}]}, {'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/fa49d164fdc3df8a63a9aea9fbf93bb7/1040g2sg30r8bhcv71k205ovlsoc43ng704gagn8!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/bc6572d5ffd13a5a0d5928a6a463343b/1040g2sg30r8bhcv71k205ovlsoc43ng704gagn8!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': ''}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/ff1d231a2905b28c0ecd455fdb56d920/1040g2sg30r8bhcv71k2g5ovlsoc43ng7oqh7u80!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/3b837b8b073b25e9fac856043856a955/1040g2sg30r8bhcv71k2g5ovlsoc43ng7oqh7u80!nd_whgt34_webp_wm_1'}]}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/fd4803801cec0e957c4dfc6eb828de2f/1040g2sg30r8bhcv71k305ovlsoc43ng7enq393g!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/1c1670b8059b174cb051b1c441d589d0/1040g2sg30r8bhcv71k305ovlsoc43ng7enq393g!nd_whgt34_webp_wm_1'}]}, {'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/77dab91e9477cf7f97cc75e9364f8b28/1040g2sg30r8bhcv71k3g5ovlsoc43ng738v26bg!nd_whgt34_webp_prv_1', 'image_scene': 'CRD_PRV_WEBP'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/db116705075e5dc6712688ef6aa7cbe0/1040g2sg30r8bhcv71k3g5ovlsoc43ng738v26bg!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720}, {'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/62cff2970f52040b66121605a1afdec9/1040g2sg30r8bhcv71k405ovlsoc43ng7pelho68!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/abe999ee14194c82c243c78285642393/1040g2sg30r8bhcv71k405ovlsoc43ng7pelho68!nd_whgt34_webp_wm_1'}], 'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': ''}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/a2b2fb02421561f666a8c9e9b17a44cb/1040g2sg30r8bhcv71k4g5ovlsoc43ng742e5lho!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/89099c334af5049f51ea927f87e55a1d/1040g2sg30r8bhcv71k4g5ovlsoc43ng742e5lho!nd_whgt34_webp_wm_1'}]}, {'file_id': '', 'height': 1720, 'width': 1290, 'url': '', 'trace_id': '', 'info_list': [{'image_scene': 'CRD_PRV_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/28aca59f5a5f61b38e96cc8d16342443/1040g2sg30r8bhcv71k505ovlsoc43ng766ca560!nd_whgt34_webp_prv_1'}, {'image_scene': 'CRD_WM_WEBP', 'url': 'http://sns-webpic-qc.xhscdn.com/202312022321/47ad43ee9daffb6a30d4c812a91dce02/1040g2sg30r8bhcv71k505ovlsoc43ng766ca560!nd_whgt34_webp_wm_1'}]}]tag_list: ['省考', '国考', '考公', '公务员考试', '公务员省考', '公务员面试', '公务员']upload_time: 1699509542000note_ip_location: 安徽

    不知道这些数据有啥用?感觉全量抓取才会更有价值。

城市感知计算
认识世界和改造世界,张岩博士和志愿者团队搭建的非盈利城市科学分享平台,欢迎加好友学术交流。
 最新文章