Title: Mining Chinese Historical Sources At Scale: A Machine Learning-Approach to Qing State Capacity
大规模挖掘中国历史资料:机器学习在研究清朝国家能力中的应用
Wolfgang Keller
科罗拉多大学
Carol H. Shiue
科罗拉多大学
Sen Yan
科罗拉多大学
Primary historical sources are often by-passed for secondary sources due to high human costs of accessing and extracting primary information–especially in lower-resource settings. We propose a supervised
machine-learning approach to the natural language processing of Chinese historical data. An application to identifying different forms of social unrest in the Veritable Records of the Qing Dynasty shows
that approach cuts dramatically down the cost of using primary source data at the same time when
it is free from human bias, reproducible, and flexible enough to address particular questions. External
evidence on triggers of unrest also suggests that the computer-based approach is no less successful in
identifying social unrest than human researchers are.
原始历史资料常常因为获取和提取一手信息的高人力成本而被次级资料所取代,特别是在资源较少的环境中。本文提出了一种监督式机器学习方法(GUWEN-BERT)来处理中文历史数据的自然语言处理。将这种方法应用于识别《清实录》中不同形式的社会动荡,结果表明,这种方法在大幅降低使用一手资料数据成本的同时,还避免了人为偏见,可复现,并且足够灵活以应对特定问题。外部证据也表明,基于计算机的方法在识别社会动荡方面并不逊色于人类研究者。
Tips:“机器学习在经济金融领域的应用”研讨会即将举办,欢迎对机器学习方法及其应用感兴趣的学者和学生报名!
疯狂暗示↓↓↓↓↓↓↓↓↓↓↓