进入公众号 点击右上角“...”设为星标 防止内容走丢
本期文章
统计学科大规模多层学术网络数据集——网络构建、描述分析与实际应用
Statistical Large-Scale Multi-Layer Academic Network Dataset—Network Construction, Descriptive Analysis, and Applications
【原文刊载在《经济管理学刊》2024年第3卷第4期】(2024年12月出版)
作者
高天辰,张妍,厦门大学经济学院
Tianchen Gao, Yan Zhang (School of Economics, Xiamen University)
潘蕊,中央财经大学统计与数学学院
Rui Pan (School of Statistics and Mathematics, Central University of Finance and Economics)
摘要
多层学术网络描述了学术实体之间的多样化关系, 对学术发展的研究以及未来方向的预测产生了促进作用。然而, 目前的数据集通常只包含单一网络, 缺少能够同时考虑多层学术网络的大规模数据集。本文提供了一个高质量的大规模多层学术网络数据集 (LMANStat 数据集), 其中包括合著网络、 机构合作网络、 引文网络、 期刊引用网络、 作者 引用网络、 共引用网络、 作者 - 论文网络以及关键词共现网络 8 个学术网络。此外, 所构建的多层学术网络的每一层都是动态变化的。最后, 本文提供了节点的属性, 如作者的研究兴趣、 产出力、 地区和机构。该数据集当前是统计学领域规模较大、 较为全面的学术网络数据集之一, 包括多个学术网络类型, 具有广泛的时间跨度和时变性, 以及丰富的节点属性信息, 填补了现有数据集的空白。借助该数据集, 可以从多个角度研究统计学科的发展和演化, 为研究具有多层结构的复杂系统提供了丰富的研究基础。
关键词
关键词:多层学术网络; LMANStat 数据集; 引文网络; 合著网络
Keywords: Multi-Layer Academic Networks; LMANStat Dataset; Citation Network; Collaboration Network
内容精要
一、研究背景与意义
当前,由不同实体间多种交互构成的关系结构十分普遍。作为分析此类数据的有效工具,多层网络因其能够捕捉现实系统复杂性的特性,近年来备受关注。多层学术网络是多层网络的一类具体应用,由学术实体(作者、机构、论文或期刊)间的多层关系构成。每个层代表一种不同类型的关联,可将这些层分别分析以捕捉各层的具体信息,或者联合分析以利用可能在不同关系之间共享的信息。因此,有必要从多个角度对多层网络进行分析,以更全面地理解底层系统。
尽管多层学术网络非常有用,但大多数可用的学术网络通常只有一层,其中合作网络和引文网络的研究较为广泛。在统计学领域,金加顺(Jiashun Jin)教授的研究团队此前进行过相关工作,这也极大地启发了本文的研究。他们在期刊选择、出版物数据收集和清理以及学术网络构建和分析方面进行了大量原创和有建设性的工作。例如,利用包含1975-2015年间在统计学、概率论和机器学习领域的36个代表性期刊上发表的83331篇文章的数据集,研究了统计学者的共引用和共同作者关系网络。金加顺教授的研究团队克服了许多挑战,如获得完整的论文信息、姓名匹配和清洗,以及处理在线信息的不一致性。然而,当前尚未有数据集能够提供具有多层结构的学术网络,仅局限于合著网络和引文网络,使得研究各个学术实体(作者、机构、论文或期刊)间的互动关系成为困难,并且少有数据集能够提供节点属性。因此,本文工作的独特之处在于它提供了一个更多样化和全面的学术网络集合,不仅限于合著网络和引文网络,而是提供了一个由8个学术网络构成的多层学术网络。此外,本文介绍了一种创新的数据清理和构建网络属性的方法。
Summary
Relational structures consisting of different types of interactions among several groups of entities are very common nowadays.As a useful tool for analyzing this type of data,multi-layer networks have gained increasing attention in recent years due to their ability to capture the complexity of real-world systems.Multi-layer academic networks are a specific type of multi-layer network that consists of multiple layers of relationships among academic entities,such as researchers,institutions,papers,or journals.Typical examples of multi-layer academic networks include the collaboration network that represents co-authorship relationships among researchers,the citation network that represents citation relationships among papers,and the journal citation networks that represent citation relationships among journals.They have been used for various purposes,such as identifying research areas,evaluating research impact,predicting scientific trends,studying the diffusion of scientific knowledge,and supporting science policy and decision-making.Overall,multi-layer academic networks provide a powerful tool for understanding and analyzing the complex relationships that underlie academic communities and their impact on scientific knowledge production and dissemination.
In this work,we collect data from 42 statistical journals published between 1981 and 2021 from the Web of Science(www.webofscience.com).Our LMANStat dataset includes basic information on 97,436 papers,including their title,abstract,keywords,publisher,published date,volume and pages,document type,citation counts,author information (name,ORCID,address,region,and institution),as well as their reference lists.Based on this information,we construct multi-layer academic networks,including collaboration network,co-institution network,citation network,co-citation network,journal citation network,author citation network,author-paper network,and keyword co-occurrence network.These networks change dynamically over time,providing a dynamic analytical perspective during analysis.Moreover,we also include rich nodal attributes of authors,such as the authors’ research interests,to enhance the usefulness of our dataset.The LMANStat dataset is publicly available on GitHub,and can be accessed directly at https://github.com/Gaotianchen97/LMANStat.
We present a comprehensive overview of our methodology,which covers the complete workflow from data collection to data cleaning,as well as the construction of multi-layer academic networks.Subsequently,we provide detailed explanations regarding author and paper identification,the extraction of author attributes,and the construction of multi-layer academic networks.Next,we validate the dataset through various potential scenarios for exploring and analyzing our multi-layer academic networks.To emphasize the usability of our dataset,key insights into the characteristics of the data are also provided,aligning with historical research findings and the consensus among statisticians.More importantly,the LMANStat dataset is extensively utilized by our research team to validate its usability.In our multi-layer academic networks,the collaboration network and citation network are the most commonly used networks.Therefore,we utilize them for verification.Additionally,we also consider the journal citation network with journals as nodes and the keyword co-occurrence network with keywords as nodes to validate the LMANStat dataset.
For the collaboration network,the scale-free phenomenon can be detected through a log-log degree distribution plot,which is also referenced in research on collaborative networks.The average number of authors per paper shows an increasing trend by year,indicating collaboration trends in statistical research.By visualizing a sub-network of the collaboration network,we identified the top 4 authors with the highest degrees,whose innovative methods and insights have had a profound impact on the field of statistics.As for the citation network,the in-degree of a paper is crucial as it represents the number of times the paper has been cited within the network.A higher in-degree implies a greater number of citations for a paper.Within our dataset,the average in-degree per paper is 5.31,which correlates closely with the Impact Factor (IF) of the selected journals.
In addition,journal citation networks are often employed for ranking journals,which is considered an important indicator for evaluating the quality and impact of publications in specific research fields.Therefore,we validate the accessibility of the journal citation network through journal ranking.By calculating the PageRank centrality of each node (journal),we can effectively rank journals based on their importance.Interestingly,we observe the phenomenon that the ranking of journals based on PageRank centrality closely aligns with the expectations and intuitions of statisticians.This suggests that the PageRank-based approach provides a ranking that resonates well with the perceptions of experts in the field.
It is important to note that the multi-layer academic networks presented in this paper are all dynamic in nature.Taking the citation network as an illustration,we showcase the dynamic nature of the network.The visualization includes snapshots of the network from different time periods:1980—2006,1980—2010,and 1980—2020.It is evident from the visualization that the citation network exhibits a community structure that undergoes constant changes over time.This community has shown continuous growth over the years,as indicated by the increasing number of papers associated with variable selection.In conclusion,it is worth emphasizing that the networks within this dataset are all dynamic,thereby enabling the exploration of dynamic nature.
In conclusion,the paper utilizes statistical publication data collected from the Web of Science to provide a large-scale,high-quality multi-layer academic network dataset (LMANStat dataset).The study further validates the quality and usability of these constructed multi-layer academic networks from multiple perspectives.It discusses feasible research directions and application scenarios,including but not limited to exploring community structures within academic networks,tracing the development and evolution of research topics,investigating mechanisms behind citation counts of papers,discussing the impact of international and inter-institutional collaborations,exploring career planning and development of researchers,and establishing more diversified journal ranking systems.
原文引用:高天辰, 张妍, 潘蕊. 统计学科大规模多层学术网络数据集——网络构建、描述分析与实际应用[J]. 经济管理学刊, 2024, 3(4): 237-256.
点击左下角“阅读原文”,即可下载全文PDF
(苹果系统需复制到浏览器打开)
学刊订阅方式及更多论文下载,请登录学刊官网www.qjem.cn
*我们期待公众号原创稿件,来稿、合作、问题请联系:qjem-wx ;推广内容如有侵权请您告知,我们会在第一时间处理或撤销;转载仅供思考,不代表《经济管理学刊》立场;其他平台任何形式转载请注明(来源:经济管理学刊 )。
《经济管理学刊》是机械工业信息研究院和北京大学联合主办、机械工业出版社出版的经管领域综合性学术刊物。本刊编委会汇聚了来自国内外著名高校和研究机构的近90名经济管理领域的杰出学者,并由北京大学光华管理学院院长刘俏教授担任主编。
诚挚邀请国内外专家、学者赐稿。相信在国内外学术共同体的努力下,《经济管理学刊》将成为汇聚全球重要经管理论和思想的平台,为中国的经管学术思想再添新翼,助力中国大地涌现出更多世界级的经济学和管理学研究与思想。
投稿请登录本刊官网www.qjem.cn。
投稿咨询
刘欣欣:010-62747698
编辑部联系
朱鹤楼:010-88379001
侯振锋:010-88379708
邮 箱:qjem@qjem.cn
地 址:北京市西城区百万庄大街22号3号楼9层
学刊相关目录
文章编辑:侯曼迪;责任编辑:侯振锋;审核人:朱鹤楼