- 消息
人工智能模型是否比研究人员产生更多原创想法?
根据本月在 arXiv 上发布的预印本1 ,由人工智能 (AI) 驱动的创意生成器产生的原创研究想法比 50 位独立工作的科学家还要多。
人类和人工智能提出的想法由评审员进行评估,评审员并不知道每个想法是谁或什么产生的。评审员认为人工智能提出的概念比人类提出的概念更令人兴奋,尽管人工智能的建议在可行性方面得分略低。
但科学家指出,这项研究尚未经过同行评审,存在局限性。它专注于一个研究领域,并要求人类参与者在实践中提出想法,这可能会阻碍他们提出最佳概念的能力。
人工智能在科学中的应用
人们正在努力探索如何使用 LLM 来自动化研究任务,包括撰写论文、生成代码和搜索文献。但很难评估这些人工智能工具是否能以与人类相似的水平产生新的研究角度。这项研究的共同作者司成蕾说,这是因为评估想法是非常主观的,需要聚集有专业知识的研究人员来仔细评估它们。加州斯坦福大学的计算机科学家司成蕾说:“对我们来说,将这些能力具体化的最佳方式是进行面对面的比较。”
耶路撒冷艾伦人工智能研究所的计算机科学家汤姆·霍普 (Tom Hope) 表示,这个为期一年的项目是评估大型语言模型 (LLM)(ChatGPT等工具的基础技术)是否能够产生创新研究想法的最大努力之一。“还需要做更多这样的工作,”他说。
该团队招募了 100 多名自然语言处理研究人员,自然语言处理是计算机科学的一个分支,专注于人工智能与人类之间的交流。49 名参与者的任务是在 10 天内根据七个主题之一开发和撰写创意。作为奖励,研究人员为每个创意向参与者支付 300 美元,并为得分最高的五个创意支付 1,000 美元的奖金。
同时,研究人员使用由加州旧金山 Anthropic 开发的法学硕士 Claude 3.5 构建了一个创意生成器。研究人员使用人工智能文献搜索引擎 Semantic Scholar 提示他们的人工智能工具查找与七个研究主题相关的论文。在这些论文的基础上,研究人员随后提示他们的人工智能代理针对每个研究主题生成 4,000 个创意,并指示它对最原创的创意进行排名。
人工审阅
接下来,研究人员将人类和人工智能提出的想法随机分配给 79 名审阅者,审阅者根据每个想法的新颖性、刺激性、可行性和预期效果对其进行评分。为了确保审阅者不知道想法的创造者是谁,研究人员使用另一个法学硕士来编辑这两种类型的文本,以标准化写作风格和语气,而不会改变想法本身。
平均而言,评审人员认为人工智能产生的想法比人类参与者的想法更新颖、更令人兴奋。然而,当团队仔细研究了 4,000 个 LLM 产生的想法时,他们发现只有大约 200 个想法是真正独特的,这表明人工智能在大量产生想法时变得不那么原创。
当 Si 对参与者进行调查时,大多数人承认,他们提交的想法与他们过去提出的想法相比很一般。
加拿大温哥华不列颠哥伦比亚大学的机器学习研究员 Cong Lu 表示,研究结果表明,法学硕士可能能够提出比现有文献略具原创性的想法。但它们能否击败最具开创性的人类想法仍是一个悬而未决的问题。
另一个限制是,这项研究比较了经过法学硕士编辑的书面想法,这改变了提交的语言和长度,西雅图华盛顿大学的计算社会科学家杰文·韦斯特说。他说,这种变化可能会微妙地影响审稿人对新颖性的判断。韦斯特补充说,让研究人员与能在几个小时内产生数千个想法的法学硕士进行对比可能并不完全公平。“你必须进行同类比较,”他说。
Si 和他的同事计划将人工智能产生的想法与领先的会议论文进行比较,以更好地了解法学硕士与人类创造力相比如何。“我们正试图推动社区更加深入地思考,当人工智能能够在研究过程中发挥更积极的作用时,未来应该是什么样子,”他说。
机构编号: https://doi.org/10.1038/d41586-024-03070-5
Do AI models produce more original ideas than researchers?
An ideas generator powered by artificial intelligence (AI) came up with more original research ideas than did 50 scientists working independently, according to a preprint posted on arXiv this month1.
The human and AI-generated ideas were evaluated by reviewers, who were not told who or what had created each idea. The reviewers scored AI-generated concepts as more exciting than those written by humans, although the AI’s suggestions scored slightly lower on feasibility.
But scientists note the study, which has not been peer-reviewed, has limitations. It focused on one area of research and required human participants to come up with ideas on the fly, which probably hindered their ability to produce their best concepts.
AI in science
There are burgeoning efforts to explore how LLMs can be used to automate research tasks, including writing papers, generating code and searching literature. But it’s been difficult to assess whether these AI tools can generate fresh research angles at a level similar to that of humans. That’s because evaluating ideas is highly subjective and requires gathering researchers who have the expertise to assess them carefully, says study co-author, Chenglei Si. “The best way for us to contextualise such capabilities is to have a head-to-head comparison,” says Si, a computer scientist at Stanford University in California.
The year-long project is one of the biggest efforts to assess whether large language models (LLMs) — the technology underlying tools such as ChatGPT — can produce innovative research ideas, says Tom Hope, a computer scientist at the Allen Institute for AI in Jerusalem. “More work like this needs to be done,” he says.
The team recruited more than 100 researchers in natural language processing — a branch of computer science that focuses on communication between AI and humans. Forty-nine participants were tasked with developing and writing ideas, based on one of seven topics, within ten days. As an incentive, the researchers paid the participants US$300 for each idea, with a $1,000 bonus for the five top-scoring ideas.
Meanwhile, the researchers built an idea generator using Claude 3.5, an LLM developed by Anthropic in San Francisco, California. The researchers prompted their AI tool to find papers relevant to the seven research topics using Semantic Scholar, an AI-powered literature-search engine. On the basis of these papers, the researchers then prompted their AI agent to generate 4,000 ideas on each research topic and instructed it to rank the most original ones.
Human reviewers
Next, the researchers randomly assigned the human- and AI-generated ideas to 79 reviewers, who scored each idea on its novelty, excitement, feasibility and expected effectiveness. To ensure that the ideas’ creators remained unknown to the reviewers, the researchers used another LLM to edit both types of text to standardize the writing style and tone without changing the ideas themselves.
On average, the reviewers scored the AI-generated ideas as more original and exciting than those written by human participants. However, when the team took a closer look at the 4,000 LLM-produced ideas, they found only around 200 that were truly unique, suggesting that the AI became less original as it churned out ideas.
When Si surveyed the participants, most admitted that their submitted ideas were average compared with those they had produced in the past.
The results suggest that LLMs might be able to produce ideas that are slightly more original than those in the existing literature, says Cong Lu, a machine-learning researcher at the University of British Columbia in Vancouver, Canada. But whether they can beat the most groundbreaking human ideas is an open question.
Another limitation is that the study compared written ideas that had been edited by an LLM, which altered the language and length of the submissions, says Jevin West, a computational social scientist at the University of Washington in Seattle. Such changes could have subtly influenced how reviewers perceived novelty, he says. West adds that pitting researchers against an LLM that can generate thousands of ideas in hours might not make for a totally fair comparison. “You have to compare apples to apples,” he says.
Si and his colleagues are planning to compare AI-generated ideas with leading conference papers to gain a better understanding of how LLMs stack up against human creativity. “We are trying to push the community to think harder about how the future should look when AI can take on a more active role in the research process,” he says.
doi: https://doi.org/10.1038/d41586-024-03070-5