研究:用于训练大型语言模型的数据集通常缺乏透明度

文摘   2024-08-31 17:36   北京  

研究:用于训练大型语言模型的数据集通常缺乏透明度

研究人员开发了一种易于使用的工具,使人工智能从业者能够找到适合其模型目的的数据,从而提高准确性并减少偏见。

为了训练更强大的大型语言模型,研究人员使用庞大的数据集集合,融合了来自数千个网络源的不同数据。

但随着这些数据集被组合并重新组合成多个集合,有关其来源及其使用限制的重要信息常常会在混乱中丢失或混淆。

这不仅引发法律和道德问题,还会损害模型的性能。例如,如果数据集被错误分类,那么为某项任务训练机器学习模型的人可能会在不知情的情况下使用并非为该任务设计的数据。

此外,来自未知来源的数据可能包含偏见,导致模型在部署时做出不公平的预测。

为了提高数据透明度,来自麻省理工学院等机构的多学科研究人员团队对热门托管网站上的 1,800 多个文本数据集进行了系统性审核。他们发现,超过 70% 的数据集省略了一些许可信息,而约 50% 的数据集包含错误信息。

基于这些见解,他们开发了一种用户友好的工具,称为 数据来源浏览器,可以自动生成数据集的创建者、来源、许可证和允许用途的易于阅读的摘要。

麻省理工学院教授、麻省理工学院媒体实验室人类动力学小组负责人、该项目新开放获取论文的共同作者亚历克斯“桑迪”彭特兰 (Alex “Sandy” Pentland) 表示:“这些类型的工具可以帮助监管者和从业者就人工智能部署做出明智的决策,并进一步推动人工智能的负责任发展 

数据来源浏览器可以帮助 AI 从业者构建更有效的模型,使他们能够选择适合其模型预期用途的训练数据集。从长远来看,这可以提高 AI 模型在现实世界中的准确性,例如用于评估贷款申请或响应客户查询的模型。

“了解人工智能模型的能力和局限性的最佳方法之一是了解它是基于哪些数据进行训练的。如果你对数据的来源存在错误归因和混淆,就会出现严重的透明度问题,”麻省理工学院人类动力学小组研究生、哈佛法学院法学博士候选人、论文共同第一作者罗伯特·马哈里 (Robert Mahari) 表示。

与 Mahari 和 Pentland 一起参与撰写这篇论文的还有媒体实验室研究生 Shayne Longpre、Cohere for AI 研究实验室负责人 Sara Hooker,以及麻省理工学院、加州大学欧文分校、法国里尔大学、科罗拉多大学博尔德分校、奥林学院、卡内基梅隆大学、Contextual AI、ML Commons 和 Tidelift 的其他研究人员。这项研究今天发表在《自然机器智能》杂志

专注于微调

研究人员经常使用一种称为微调的技术来提高大型语言模型的功能,该模型将用于特定任务,例如问答。为了进行微调,他们精心构建了精选数据集,旨在提高模型在这项任务上的表现。

麻省理工学院的研究人员专注于这些微调数据集,这些数据集通常由研究人员、学术组织或公司开发并获得特定用途的许可。

当众包平台将此类数据集聚合成更大的集合,供从业者进行微调时,一些原始许可信息往往会被遗忘。

“这些许可证应该很重要,而且应该具有强制执行力,”马哈里说。

例如,如果数据集的许可条款错误或缺失,有人可能会花费大量金钱和时间来开发模型,但后来他们可能会被迫放弃这个模型,因为一些训练数据包含私人信息。

“人们最终可能会训练模型,但他们甚至不了解这些模型的能力、问题或风险,而这些最终源于数据,”Longpre 补充道。

在开始这项研究时,研究人员正式将数据来源定义为数据集的来源、创建和许可历史以及其特征的组合。在此基础上,他们开发了一个结构化的审计程序,以追踪来自流行在线存储库的 1,800 多个文本数据集集合的数据来源。

在发现其中超过 70% 的数据集包含“未指定”的许可证,遗漏了大量信息后,研究人员开始逆向寻找,以填补空白。通过他们的努力,他们将“未指定”许可证的数据集数量减少到 30% 左右。

他们的工作还表明,正确的许可证通常比存储库分配的许可证更为严格。   

此外,他们发现几乎所有数据集创建者都集中在全球北部,如果模型在不同地区部署,这可能会限制其能力。例如,一个主要由美国人和中国人创建的土耳其语数据集可能不包含任何具有文化意义的方面,Mahari 解释道。

他说:“我们几乎欺骗自己,认为数据集比实际情况更加多样化。”

有趣的是,研究人员还发现对 2023 年和 2024 年创建的数据集的限制急剧增加,这可能是由于学术界担心他们的数据集可能被用于非预期的商业目的。

用户友好型工具

为了帮助其他人无需人工审核即可获得这些信息,研究人员构建了数据来源浏览器。除了根据某些标准对数据集进行排序和过滤外,该工具还允许用户下载数据来源卡,该卡提供了数据集特征的简洁、结构化概述。

Mahari 表示:“我们希望这一步不仅可以了解形势,还可以帮助人们在未来对所要训练的数据做出更明智的选择。”

未来,研究人员希望扩展他们的分析范围,以调查包括视频和语音在内的多模态数据的数据来源。他们还想研究作为数据源的网站的服务条款如何在数据集中得到体现。

在扩大研究范围的同时,他们也在与监管机构联系,讨论他们的发现以及微调数据的独特版权影响。

“当人们创建和发布这些数据集时,我们从一开始就需要数据来源和透明度,以便其他人更容易获得这些见解,”Longpre 说。

“许多提议的政策干预措施都假设我们可以正确分配和识别与数据相关的许可证,而这项工作首先表明情况并非如此,然后显著改善了可用的来源信息,”未参与这项工作的 EleutherAI 执行董事 Stella Biderman 表示。“此外,第 3 节包含相关的法律讨论。这对于规模足够大到拥有专门法律团队的公司以外的机器学习从业者非常有价值。许多想要为公众利益构建人工智能系统的人目前正在悄悄努力弄清楚如何处理数据许可,因为互联网的设计方式并不容易弄清楚数据来源。”

论文:《人工智能数据集许可和归属的大规模审计》

Study: Transparency is often lacking in datasets used to train large language models

Researchers developed an easy-to-use tool that enables an AI practitioner to find data that suits the purpose of their model, which could improve accuracy and reduce bias.

In order to train more powerful large language models, researchers use vast dataset collections that blend diverse data from thousands of web sources.

But as these datasets are combined and recombined into multiple collections, important information about their origins and restrictions on how they can be used are often lost or confounded in the shuffle.

Not only does this raise legal and ethical concerns, it can also damage a model’s performance. For instance, if a dataset is miscategorized, someone training a machine-learning model for a certain task may end up unwittingly using data that are not designed for that task.

In addition, data from unknown sources could contain biases that cause a model to make unfair predictions when deployed.

To improve data transparency, a team of multidisciplinary researchers from MIT and elsewhere launched a systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 percent of these datasets omitted some licensing information, while about 50 percent had information that contained errors.

Building off these insights, they developed a user-friendly tool called the Data Provenance Explorerthat automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses.

“These types of tools can help regulators and practitioners make informed decisions about AI deployment, and further the responsible development of AI,” says Alex “Sandy” Pentland, an MIT professor, leader of the Human Dynamics Group in the MIT Media Lab, and co-author of a new open-access paper about the project.

The Data Provenance Explorer could help AI practitioners build more effective models by enabling them to select training datasets that fit their model’s intended purpose. In the long run, this could improve the accuracy of AI models in real-world situations, such as those used to evaluate loan applications or respond to customer queries.

“One of the best ways to understand the capabilities and limitations of an AI model is understanding what data it was trained on. When you have misattribution and confusion about where data came from, you have a serious transparency issue,” says Robert Mahari, a graduate student in the MIT Human Dynamics Group, a JD candidate at Harvard Law School, and co-lead author on the paper.

Mahari and Pentland are joined on the paper by co-lead author Shayne Longpre, a graduate student in the Media Lab; Sara Hooker, who leads the research lab Cohere for AI; as well as others at MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. The research is published today in Nature Machine Intelligence.

Focus on finetuning

Researchers often use a technique called fine-tuning to improve the capabilities of a large language model that will be deployed for a specific task, like question-answering. For finetuning, they carefully build curated datasets designed to boost a model’s performance for this one task.

The MIT researchers focused on these fine-tuning datasets, which are often developed by researchers, academic organizations, or companies and licensed for specific uses.

When crowdsourced platforms aggregate such datasets into larger collections for practitioners to use for fine-tuning, some of that original license information is often left behind.

“These licenses ought to matter, and they should be enforceable,” Mahari says.

For instance, if the licensing terms of a dataset are wrong or missing, someone could spend a great deal of money and time developing a model they might be forced to take down later because some training data contained private information.

“People can end up training models where they don’t even understand the capabilities, concerns, or risk of those models, which ultimately stem from the data,” Longpre adds.

To begin this study, the researchers formally defined data provenance as the combination of a dataset’s sourcing, creating, and licensing heritage, as well as its characteristics. From there, they developed a structured auditing procedure to trace the data provenance of more than 1,800 text dataset collections from popular online repositories.

After finding that more than 70 percent of these datasets contained “unspecified” licenses that omitted much information, the researchers worked backward to fill in the blanks. Through their efforts, they reduced the number of datasets with “unspecified” licenses to around 30 percent.

Their work also revealed that the correct licenses were often more restrictive than those assigned by the repositories.   

In addition, they found that nearly all dataset creators were concentrated in the global north, which could limit a model’s capabilities if it is trained for deployment in a different region. For instance, a Turkish language dataset created predominantly by people in the U.S. and China might not contain any culturally significant aspects, Mahari explains.

“We almost delude ourselves into thinking the datasets are more diverse than they actually are,” he says.

Interestingly, the researchers also saw a dramatic spike in restrictions placed on datasets created in 2023 and 2024, which might be driven by concerns from academics that their datasets could be used for unintended commercial purposes.

A user-friendly tool

To help others obtain this information without the need for a manual audit, the researchers built the Data Provenance Explorer. In addition to sorting and filtering datasets based on certain criteria, the tool allows users to download a data provenance card that provides a succinct, structured overview of dataset characteristics.

“We are hoping this is a step, not just to understand the landscape, but also help people going forward to make more informed choices about what data they are training on,” Mahari says.

In the future, the researchers want to expand their analysis to investigate data provenance for multimodal data, including video and speech. They also want to study how terms of service on websites that serve as data sources are echoed in datasets.

As they expand their research, they are also reaching out to regulators to discuss their findings and the unique copyright implications of fine-tuning data.

“We need data provenance and transparency from the outset, when people are creating and releasing these datasets, to make it easier for others to derive these insights,” Longpre says.

“Many proposed policy interventions assume that we can correctly assign and identify licenses associated with data, and this work first shows that this is not the case, and then significantly improves the provenance information available,” says Stella Biderman, executive director of EleutherAI, who was not involved with this work. “In addition, section 3 contains relevant legal discussion. This is very valuable to machine learning practitioners outside companies large enough to have dedicated legal teams. Many people who want to build AI systems for public good are currently quietly struggling to figure out how to handle data licensing, because the internet is not designed in a way that makes data provenance easy to figure out.”

PAPER

Paper: “A Large Scale Audit of Dataset 


科技世代千高原
透视深度科技化时代™ 探寻合意的人类未来
 最新文章