AI firms must play fair when they use academic data in training

Researchers are among those who feel uneasy about the unrestrained use of their intellectual property in training commercial large language models. Firms and regulators need to agree the rules of engagement.

27 August 2024


1. editorial [edɪˈtɔːrɪəl] adj. 编辑的,主编的;社论的 n. 社论,社评;(报刊的)由编辑撰写的部分,非广告部分;editorials

2. academic [ˌækəˈdemɪk] adj. 学业的,学术的;学校的,学院的;学业(成绩)优秀的,善于学习的;不切实际的,空谈的 n. 大学教师,学者;(学校或学院的)课程;大学生;academics

3. uneasy [ʌnˈɪːzɪ] adj. 担忧的,不安的;令人不舒服的,不安稳的;(情况)不稳定的;(书或音乐)难懂的,晦涩的;不和谐的,矛盾的;uneasier

4. intellectual [ˌɪntɪˈlektʃuəl] adj. 智力的,理智的;才智超群的;需智力的;思想的,思维的 n. 知识分子;intellectuals

5. property [ˈprɒpətɪ] n. 所有物,财产;地产,房地产;房地产股票(或投资)(properties);所有权,处置权;特性,性质;properties

6. engagement [ɪnˈgeɪdʒmənt] n. 婚约,订婚;约会,约定;交战,战斗;演出任务;聘用,雇用;参加,从事;(与……的)密切关系,(对……的)了解;啮合;(与……的)密切关系,(对……的)了解;engagements

7. scrape [skreɪp] v. (用小刀等)刮除;(使)刮擦;擦伤,刮坏;(使)发出刺耳的刮擦声;勉强维持,艰难地完成;勉强维持生活(scrape by/along);勉强通过;(艰难地)凑集,积累(scrape sth. together/up);节俭;挖坑,挖洞;把头发拢在后面(scrape sth. back);(幽默)不入调地演奏小提琴;在面包上涂一层薄薄的(黄油,人造黄油);用程序从网上下载(数据) n. 擦伤,擦痕;刮擦,刮擦声;困境,窘境;地面空洞处,(尤指鸟求爱、筑巢时)在地面挖的洞;涂在面包上的一层黄油(或麦琪淋);(医)刮宫;scrapes;scrapes;scraping;scraped;scraped

8. web [web] n. (蜘蛛)网;网状物,错综复杂的事物;网络;(鸟兽的)蹼;连接板,金属薄条(片);(连续印刷用)一卷纸;(制造卷筒纸的造纸机上的)无端金属丝网;织物 v. 用网(或网状物)覆盖;使中圈套;形成网;webs;webs;webbing;webbed;webbed

9. generate [ˈdʒenəˌreɪt] v. 产生,引起;generates;generating;generated;generated

10. image [ˈɪmɪdʒ] n. 形象,印象;影像,映像,图像;比喻,意象;画像,塑像,雕像;酷似另一个人的人;(人或事物)外形,外表;像点;硬盘备份;(圣经)偶像 v. 作……的像,描绘……的形象;(以探测器或电磁束)扫描出……的直观图;幻想,想像 【名】 (image)(法)伊马热(人名);images;images;imaging;imaged;imaged

11. clarity [ˈklærɪtɪ] n. 清晰易懂;思路清晰;(画面或声音的)清晰,清楚;清澈,明净 【名】 (clarity)(英)克拉里蒂(人名)

12. boundary [ˈbaʊndərɪ] n. 分界线,边界;界限,范围;(板球)击球超过边界线得分;boundaries

13. category [ˈkætɪgərɪ] n. 种类,范畴;categories

14. accord [əˈkɔːd] n. 协议,条约;符合,一致 v. 使受到,给予(某种待遇);(与……)一致,符合;accords;accords;according;accorded;accorded

15. frontier [ˈfrʌntjə] n. 边境,国界;(常sing.) 西部边疆,边远地区;(尤指知识的)前沿,新领域;西部边疆,边远地区 adj. 边境的,边疆的 【名】 (frontier)(法)弗龙捷(人名);frontiers

16. giant [ˈdʒaɪənt] n. (传说中的)巨人;高大健壮的人;巨兽,巨型植物;大公司,大国;卓越人物,伟人;(天文)巨星 adj. 巨大的,伟大的;giants

17. copyright [ˈkɒpɪraɪt] n. 版权,著作权 adj. 版权的,受版权保护的 v. 获得……的版权;保护……的版权;copyrights;copyrights;copyrighting;copyrighted;copyrighted

18. fundamental [ˌfʌndəˈmentl] adj. 根本的,基本的;必需的,必不可少的;不能再分的 n. 基本原理;基音,基频;fundamentals

19. current [ˈkʌrənt] adj. 现行的,当前的;通用的,流行的;最近的 n. 水流,气流;电流;思潮,趋势 【名】 (current)(英)柯伦特(人名);currents

20. allege [əˈledʒ] v. (未经证实地)宣称,指控;alleges;alleging;alleged;alleged

21. holder [ˈhəʊldə] n. 持有者,占有者;支托物;小农,小佃农 【名】 (holder)(英、罗、瑞典、德)霍尔德(人名);holders

22. code [kəʊd] n. 密码,暗码;(邮政)编码,(电话)区号;(计算机)编码;道德准则,行为规范;法典,法规 v. 把……编码(或编号);把……译成密码;(给计算机)编写指令 【名】 (code)(英、法、西)科德(人名);codes;codes;coding;coded;coded

23. creative [krɪ(ː)ˈeɪtɪv] adj. 创造(性)的,创作的;有创造力的,有想象力的 n. 创作者;创意,创作素材

24. impact [ˈɪmpækt] n. 撞击,冲击力;巨大影响,强大作用 v. 冲击,撞击;挤入,压紧;(对……)产生影响

25. ensure [ɪnˈʃʊə] v. 确保,保证;保护,使安全;ensures;ensuring;ensured;ensured

26. highly [ˈhaɪlɪ] adv. 极其,非常;高度地,高水平地;钦佩地,赞赏地;在高处,地位高

27. relevant [ˈrelɪvənt] adj. 有关的,切题的;正确的,适宜的;有价值的,有意义的

28. presumably [prɪˈzjuːməbəlɪ] adv. 大概,可能

29. learning [ˈlɜːnɪŋ] n. 学习;知识,学问 v. 得知,获悉;学习,学会;认识到,从……吸取教训(learn 的现在分词形式)

30. rejoice [rɪˈdʒɔɪs] v. 非常高兴,深感欣喜;享有(用于吸引人注意奇特之处,尤指名字)(rejoice in);使感到高兴,使喜悦;rejoices;rejoicing;rejoiced;rejoiced

31. insight [ˈɪnsaɪt] n. 洞悉,了解;洞察力 【名】 (insight)(英)因赛特(人名);insights

32. license [ˈlaɪsəns] n. 执照,许可证;特许(同 licence) vt. 许可;特许;发许可证给;licenses;licenses;licensing;licensed;licensed

33. factor [ˈfæktə] n. 因素,要素;等级,系数;因数,因子;遗传因子,基因;(血液中的)凝血因子;代理公司,代理商;地产管理人,管家;测量水平 v. 把……作为因素计入,把……包括在内(factor in);把……作为因素排除,不把……包括在内(factor out);将……分解为因子;代理经营,(代管)产业;做代理商 【名】 (factor)(英)法克特(人名);factors;factors;factoring;factored;factored

34. source [sɔːs] n. 来源,出处;(问题的)原因,根源;消息人士,信息来源;河流源头,发源地;源(代)码;(电子)源极,电源;(技)源 v. (从某地)获得;找出……的来源 【名】 (source)(法)苏尔斯(人名);sources;sources;sourcing;sourced;sourced

35. cite [saɪt] v. 引用,援引;引证,引以为例;传唤,传讯;嘉奖,表彰 n. 引用,引文;cites;cites;citing;cited;cited

36. opt [ɒpt] v. 选择,作出抉择

37. enforce [ɪnˈfɔːs] v. 实施,执行(法律、规章);强迫,迫使;竭力使人接受(要求,论点);enforces;enforcing;enforced;enforced

38. devise [dɪˈvaɪz] v. 设计,发明;(通过遗嘱)遗赠(不动产);图谋 n. (遗嘱中)遗赠不动产的条款 【名】 (devise)(美、法)德维兹(人名);devises;devises;devising;devised;devised

39. medium [ˈmɪːdɪəm] n. 媒介,媒体;方法,手段;(艺术创作)材料,素材;灵媒,巫师;培养基;环境;中等,中号;存储(或打印)介质;(颜料)溶剂(如油或水);(品质、状态)中等,中庸 adj. 中等的,中间的,适中的;五分熟的,半熟的;(程度、强度或数量)平均的;(颜色)不深不浅的,适中的;(投球,投球手)中速的;media;mediums

40. specify [ˈspesɪfaɪ] v. 明确指出;具体说明;把……列入说明书;specifies;specifying;specified;specified

41. remains [rɪˈmeɪns] n. 剩余物,残留物;遗体,遗骸;古迹,遗迹 v. 仍然是,保持不变;逗留,留下;剩余,余留(remain 的第三人称单数形式)

42. radical [ˈrædɪkəl] adj. 根本的,彻底的;激进的,极端的;顶呱呱的;全新的,不同凡响的;(增减)急剧的,大幅度的;(人,物)原本的,与生俱来的;(外科,医疗)根治的;(19世纪)自由党激进派的;(数)根式的,根号的;词根的;(植)根生的 n. 激进分子;游离基,自由基;词根;(汉字)偏旁,部首;(数)根式;根号;radicals

43. accompany [əˈkʌmpənɪ] v. 陪伴,陪同;伴随,与……一起发生;为……伴奏(或伴唱);附带,补充;accompanies;accompanying;accompanied;accompanied

44. expectation [ˌekspekˈteɪʃən] n. 期待,预期;期望,指望;expectations

45. reasonable [ˈrɪːznəbl] adj. 有道理的,合情理的;(人)通情达理的,讲道理的;适度的,合适的;(价格)公道的;还算好的,尚可的;相当大的,(数量)不少的

46. dubious [ˈdʒuːbjəs] adj. 可疑的,靠不住的;有疑虑的;(荣誉、名声等)不好的,不光彩的;质量不佳的

47. strain [streɪn] n. 焦虑,紧张;负担,紧张;张力,压力;损伤,扭伤;品种,类型;气质,个性特点;旋律,曲调;口吻,语气;(物理)应变,胁变;血缘;困难,负担 v. 拉伤,扭伤;绷紧,用力拉;竭力,使劲;过滤;使不堪忍受,使紧张;用力推(或拉),拉紧 【名】 (strain)(英)斯特兰(人名);strains;strains;straining;strained;strained

AI firms must play fair when they use academic data in training
Researchers are among those who feel uneasy about the unrestrained use of their intellectual property in training commercial large language models. Firms and regulators need to agree the rules of engagement.
No one knows for sure exactly what ChatGPT — the most famous product of artificial intelligence — and similar tools were trained on. But millions of academic papers scraped from the web are among the reams of data that have been fed into large language models (LLMs) that generate text, and similar algorithms that make images (see Nature 632, 715–716; 2024). Should the creators of such training data get credit — and if so, how? There is an urgent need for more clarity around the boundaries of acceptable use.
Few LLMs — even those described as ‘open’ — have developers who are upfront about exactly which data were used for training. But information-rich, long-form text, a category that includes many scientific papers, is particularly valuable. According to an investigation by The Washington Post and the Allen Institute for Artificial Intelligence in Seattle, Washington, material from the open-access journal families PLOS and Frontiers features prominently in a data set called C4, which has been used to train LLMs such as Llama, made by the technology giant Meta. It is also widely suspected that, just as copyrighted books have been used to train LLMs, so have non-open-access research papers.
One fundamental question concerns what is allowed under current laws. The World Intellectual Property Organization (WIPO), based in Geneva, Switzerland, says that it is unclear whether collecting data or using them to create LLM outputs is considered copyright infringement, or whether these activities fall under one of several exemptions, which differ by jurisdiction. Some publishers are seeking clarity in the courts: in an ongoing case, The New York Times has alleged that the tech firms Microsoft and OpenAI — the company that developed ChatGPT — copied its articles to train their LLMs. To avoid the risk of litigation, more AI firms are now, as recommended by WIPO, purchasing licences from copyright holders for training data. Content owners are also using code on their websites that tells tools scraping data for LLMs whether they are allowed to do so.
Things get much fuzzier when material is published under licences that encourage free distribution and reuse, but that can still have certain restrictions. Creative Commons, a non-profit organization in Mountain View, California, that aims to increase sharing of creative works, says that copying material to train an AI should not generally be treated as infringement. But it also acknowledges concerns about the impact of AI on creators, and how to ensure that AI that is trained on ‘the commons’ — the body of freely available material — contributes to the commons in return.
These broader questions of fairness are particularly pressing for artists, writers and coders, whose livelihoods depend on their creative outputs and whose work risks being replaced by the products of generative AI. But they are also highly relevant for researchers. The move towards open-access publishing explicitly favours the free distribution and reuse of scientific work — and this presumably applies to LLMs, too. Learning from scientific papers can make LLMs better, and some researchers might rejoice if improved AI models could help them to gain new insights.
Credit where it is due
But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.
In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.
Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms.
Greater transparency can also play a part. The EU’s AI Act, which came into force on 1 August, requires developers to publish a summary of the works used to train their AI models. This could bolster creators’ ability to opt out, and might serve as a template for other jurisdictions. But it remains to be seen how this will work in practice.
Meanwhile, research should continue into whether there is a need for more-radical solutions, such as new kinds of licence or changes to copyright law. Generative AI tools are using a data ecosystem built by open-source movements, yet often ignore the accompanying expectations of reciprocity and reasonable use, says Sylvie Delacroix, a digital-law scholar at King’s College London. The tools also risk polluting the Internet with AI-generated content of dubious quality. By failing to redirect users to the human-made sources on which they were built, LLMs could disincentivize original creation. Without putting more power into the hands of creators, the system will come under severe strain. Regulators and companies must act.

AI firms must play fair when they use academic data in training



2. academic [ˌækəˈdemɪk] adj. 学业的,学术的;学校的,学院的;学业(成绩)优秀的,善于学习的;不切实际的,空谈的 n. 大学教师,学者;(学校或学院的)课程;大学生;academics

1参考↓■Researchers, 研究人员; ■among, 在…之间; ■those, 那些; ■uneasy, 心神不安的; ■about, 关于; ■unrestrained, 无拘无束; ■their, 他们的; ■intellectual, 智力的; ■property, 财产; ■training, 训练; ■commercial, 商业的; ■large, 大的; ■language, 语言; ■models, 模型; ■Firms, 公司; ■regulators, 监管机构; ■agree, 同意; ■rules, 规则; ■engagement, 订婚;

Researchers are among those who feel uneasy about the unrestrained use of their intellectual property in training commercial large language models. Firms and regulators need to agree the rules of engagement.


3. uneasy [ʌnˈɪːzɪ] adj. 担忧的,不安的;令人不舒服的,不安稳的;(情况)不稳定的;(书或音乐)难懂的,晦涩的;不和谐的,矛盾的;uneasier

4. intellectual [ˌɪntɪˈlektʃuəl] adj. 智力的,理智的;才智超群的;需智力的;思想的,思维的 n. 知识分子;intellectuals

5. property [ˈprɒpətɪ] n. 所有物,财产;地产,房地产;房地产股票(或投资)(properties); 所有权,处置权;特性,性质;properties

6. engagement [ɪnˈgeɪdʒmənt] n. 婚约,订婚;约会,约定;交战,战斗;演出任务;聘用,雇用;参加,从事;(与……的)密切关系,(对……的)了解;啮合;(与……的)密切关系,(对……的)了解;engagements

47. strain [streɪn] n. 焦虑,紧张;负担,紧张;张力,压力;损伤,扭伤;品种,类型;气质,个性特点;旋律,曲调;口吻,语气;(物理)应变,胁变;血缘;困难,负担 v. 拉伤,扭伤;绷紧,用力拉;竭力,使劲;过滤;使不堪忍受,使紧张;用力推(或拉),拉紧 【名】 (strain)(英)斯特兰(人名);strains;strains;straining;strained;strained

2参考↓■knows, 知道; ■exactly, 确切地; ■ChatGPT, ChatGPT; ■famous, 著名的; ■product, 产品; ■artificial, 人造的; ■intelligence, 智力; ■similar, 类似的; ■tools, 工具; ■trained, 训练; ■millions, 数以百万计的; ■academic, 学术的; ■papers, 纸张; ■scraped, 刮擦; ■among, 在…之间; ■reams, 铰刀; ■large, 大的; ■language, 语言; ■models, 模型; ■generate, 生成; ■similar, 类似的; ■algorithms, 算法; ■images, 图像; ■Nature, 自然; ■715–716, 715–716; ■Should, 应该; ■creators, 创作者; ■training, 训练; ■credit, 信用; ■There, 那里; ■urgent, 紧急的; ■clarity, 清晰; ■around, 围绕; ■boundaries, 边界; ■acceptable, 可接受的;

No one knows for sure exactly what ChatGPT — the most famous product of artificial intelligence — and similar tools were trained on. But millions of academic papers scraped from the web are among the reams of data that have been fed into large language models (LLMs) that generate text, and similar algorithms that make images (see Nature 632, 715–716; 2024). Should the creators of such training data get credit — and if so, how? There is an urgent need for more clarity around the boundaries of acceptable use.

没有人确切地知道ChatGPT——最著名的人工智能产品——和类似的工具是在什么基础上训练的。但是,从网络上抓取的数百万篇学术论文是被输入生成文本的大型语言模型(llm)和生成图像的类似算法的数据之一(参见Nature 632, 715-716;2024)。这些训练数据的创造者应该得到认可吗?如果应该,该怎么做?目前迫切需要更加明确可接受使用的界限。

2. academic [ˌækəˈdemɪk] adj. 学业的,学术的;学校的,学院的;学业(成绩)优秀的,善于学习的;不切实际的,空谈的 n. 大学教师,学者;(学校或学院的)课程;大学生;academics

7. scrape [skreɪp] v. (用小刀等)刮除;(使)刮擦;擦伤,刮坏;(使)发出刺耳的刮擦声;勉强维持,艰难地完成;勉强维持生活(scrape by/along);勉强通过;(艰难地)凑集,积累(scrape sth. together/up);节俭;挖坑,挖洞; 把头发拢在后面(scrape sth. back);(幽默)不入调地演奏小提琴;在面包上涂一层薄薄的(黄油,人造黄油);用程序从网上下载(数据) n. 擦伤,擦痕;刮擦,刮擦声;困境,窘境;地面空洞处,(尤指鸟求爱、筑巢时)在地面挖的洞;涂在面包上的一层黄油(或麦琪淋);(医)刮宫;scrapes;scrapes;scraping;scraped;scraped

8. web [web] n. (蜘蛛)网;网状物,错综复杂的事物;网络;(鸟兽的)蹼;连接板,金属薄条(片);(连续印刷用)一卷纸;(制造卷筒纸的造纸机上的)无端金属丝网;织物 v. 用网(或网状物)覆盖;使中圈套;形成网;webs;webs;webbing;webbed;webbed

9. generate [ˈdʒenəˌreɪt] v. 产生,引起;generates;generating;generated;generated

10. image [ˈɪmɪdʒ] n. 形象,印象;影像,映像,图像;比喻,意象;画像,塑像,雕像;酷似另一个人的人;(人或事物)外形,外表;像点;硬盘备份;(圣经)偶像 v. 作……的像,描绘……的形象;(以探测器或电磁束)扫描出……的直观图;幻想,想像 【名】 (image)(法)伊马热(人名);images;images;imaging;imaged;imaged

11. clarity [ˈklærɪtɪ] n. 清晰易懂;思路清晰;(画面或声音的)清晰,清楚;清澈,明净 【名】 (clarity)(英)克拉里蒂(人名)

3参考↓■those, 那些; ■described, 描述; ■‘open’, “打开”; ■developers, 开发者; ■upfront, 预付款; ■about, 关于; ■exactly, 确切地; ■which, 哪一个; ■training, 训练; ■information-rich, 信息丰富的; ■long-form, 长形; ■category, 类别; ■includes, 包括; ■scientific, 科学的; ■papers, 纸张; ■particularly, 尤其; ■valuable, 有价值的; ■According, 根据; ■investigation, 调查; ■Washington, 华盛顿; ■Allen, 艾伦; ■Institute, 研究所; ■Artificial, 人工; ■Intelligence, 情报; ■Seattle, 西雅图; ■Washington, 华盛顿; ■material, 材料; ■open-access, 开放存取; ■journal, 杂志; ■families, 家庭; ■Frontiers, 边界; ■features, 特征; ■prominently, 突出显示; ■called, 打电话; ■which, 哪一个; ■train, 火车; ■Llama, Llama; ■technology, 技术; ■giant, 巨人; ■widely, 广泛地; ■suspected, 怀疑; ■copyrighted, 版权所有; ■books, 书; ■train, 火车; ■non-open-access, 非开放存取; ■research, 研究; ■papers, 纸张;

Few LLMs — even those described as ‘open’ — have developers who are upfront about exactly which data were used for training. But information-rich, long-form text, a category that includes many scientific papers, is particularly valuable. According to an investigation by The Washington Post and the Allen Institute for Artificial Intelligence in Seattle, Washington, material from the open-access journal families PLOS and Frontiers features prominently in a data set called C4, which has been used to train LLMs such as Llama, made by the technology giant Meta. It is also widely suspected that, just as copyrighted books have been used to train LLMs, so have non-open-access research papers.


13. category [ˈkætɪgərɪ] n. 种类,范畴;categories

14. accord [əˈkɔːd] n. 协议,条约;符合,一致 v. 使受到,给予(某种待遇);(与……)一致,符合;accords;accords;according;accorded;accorded

15. frontier [ˈfrʌntjə] n. 边境,国界;(常sing.) 西部边疆,边远地区;(尤指知识的)前沿,新领域;西部边疆,边远地区 adj. 边境的,边疆的 【名】 (frontier)(法)弗龙捷(人名);frontiers

16. giant [ˈdʒaɪənt] n. (传说中的)巨人;高大健壮的人;巨兽,巨型植物;大公司,大国;卓越人物,伟人;(天文)巨星 adj. 巨大的,伟大的;giants

17. copyright [ˈkɒpɪraɪt] n. 版权,著作权 adj. 版权的,受版权保护的 v. 获得……的版权;保护……的版权;copyrights;copyrights;copyrighting;copyrighted;copyrighted

4参考↓■fundamental, 基本的; ■question, 问题; ■concerns, 关注点; ■allowed, 允许; ■under, 在下面; ■current, 现在的; ■World, 世界; ■Intellectual, 知识分子; ■Property, 房地产; ■Organization, 组织机构; ■based, 基于; ■Geneva, 日内瓦; ■Switzerland, 瑞士; ■unclear, 不清楚的; ■whether, 是否; ■collecting, 收集; ■using, 使用; ■create, 创造; ■outputs, 输出; ■considered, 考虑过的; ■copyright, 版权; ■infringement, 侵权; ■whether, 是否; ■these, 这些; ■activities, 活动; ■under, 在下面; ■several, 几个; ■exemptions, 豁免; ■which, 哪一个; ■differ, 相异; ■jurisdiction, 管辖权; ■publishers, 出版商; ■seeking, 寻求; ■clarity, 清晰; ■courts, 法院; ■ongoing, 不间断的; ■Times, 时代; ■alleged, 声称的; ■firms, 公司; ■Microsoft, 微软; ■OpenAI, OpenAI; ■company, 公司; ■developed, 发达的; ■ChatGPT, ChatGPT; ■copied, 复制; ■articles, 见习契约; ■train, 火车; ■their, 他们的; ■avoid, 避免; ■litigation, 诉讼; ■firms, 公司; ■recommended, 推荐; ■purchasing, 采购; ■licences, 许可证; ■copyright, 版权; ■holders, 持有人; ■training, 训练; ■Content, 内容; ■owners, 业主; ■using, 使用; ■their, 他们的; ■websites, 网站; ■tells, 告诉; ■tools, 工具; ■scraping, 刮擦; ■whether, 是否; ■allowed, 允许;

One fundamental question concerns what is allowed under current laws. The World Intellectual Property Organization (WIPO), based in Geneva, Switzerland, says that it is unclear whether collecting data or using them to create LLM outputs is considered copyright infringement, or whether these activities fall under one of several exemptions, which differ by jurisdiction. Some publishers are seeking clarity in the courts: in an ongoing case, The New York Times has alleged that the tech firms Microsoft and OpenAI — the company that developed ChatGPT — copied its articles to train their LLMs. To avoid the risk of litigation, more AI firms are now, as recommended by WIPO, purchasing licences from copyright holders for training data. Content owners are also using code on their websites that tells tools scraping data for LLMs whether they are allowed to do so.


4. intellectual [ˌɪntɪˈlektʃuəl] adj. 智力的,理智的;才智超群的;需智力的;思想的,思维的 n. 知识分子;intellectuals

5. property [ˈprɒpətɪ] n. 所有物,财产;地产,房地产;房地产股票(或投资)(properties); 所有权,处置权;特性,性质;properties

8. web [web] n. (蜘蛛)网;网状物,错综复杂的事物;网络;(鸟兽的)蹼;连接板,金属薄条(片);(连续印刷用)一卷纸;(制造卷筒纸的造纸机上的)无端金属丝网;织物 v. 用网(或网状物)覆盖;使中圈套;形成网;webs;webs;webbing;webbed;webbed

11. clarity [ˈklærɪtɪ] n. 清晰易懂;思路清晰;(画面或声音的)清晰,清楚;清澈,明净 【名】 (clarity)(英)克拉里蒂(人名)

17. copyright [ˈkɒpɪraɪt] n. 版权,著作权 adj. 版权的,受版权保护的 v. 获得……的版权;保护……的版权;copyrights;copyrights;copyrighting;copyrighted;copyrighted

18. fundamental [ˌfʌndəˈmentl] adj. 根本的,基本的;必需的,必不可少的;不能再分的 n. 基本原理;基音,基频;fundamentals

19. current [ˈkʌrənt] adj. 现行的,当前的;通用的,流行的;最近的 n. 水流,气流;电流;思潮,趋势 【名】 (current)(英)柯伦特(人名);currents

20. allege [əˈledʒ] v. (未经证实地)宣称,指控;alleges;alleging;alleged;alleged

21. holder [ˈhəʊldə] n. 持有者,占有者;支托物;小农,小佃农 【名】 (holder)(英、罗、瑞典、德)霍尔德(人名);holders

22. code [kəʊd] n. 密码,暗码;(邮政)编码,(电话)区号;(计算机)编码;道德准则,行为规范;法典,法规 v. 把……编码(或编号);把……译成密码;(给计算机)编写指令 【名】 (code)(英、法、西)科德(人名);codes;codes;coding;coded;coded

5参考↓■Things, 东西; ■fuzzier, 模糊; ■material, 材料; ■published, 出版; ■under, 在下面; ■licences, 许可证; ■encourage, 鼓励; ■distribution, 分布; ■reuse, 重新使用; ■still, 仍然; ■certain, 某些; ■restrictions, 限制; ■Creative, 创意; ■Commons, 平民; ■non-profit, 非营利组织; ■organization, 组织; ■Mountain, 山; ■California, 加利福尼亚; ■increase, 增加; ■sharing, 分享; ■creative, 创造性的; ■works, 作品; ■copying, 复制; ■material, 材料; ■train, 火车; ■should, 应该; ■generally, 通常地; ■treated, 治疗; ■infringement, 侵权; ■acknowledges, 承认; ■concerns, 关注点; ■about, 关于; ■impact, 影响; ■creators, 创作者; ■ensure, 确保; ■trained, 训练; ■commons’, 公地”; ■freely, 自由地; ■available, 可获得的; ■material, 材料; ■contributes, 贡献; ■commons, 平民; ■return, 返回;

Things get much fuzzier when material is published under licences that encourage free distribution and reuse, but that can still have certain restrictions. Creative Commons, a non-profit organization in Mountain View, California, that aims to increase sharing of creative works, says that copying material to train an AI should not generally be treated as infringement. But it also acknowledges concerns about the impact of AI on creators, and how to ensure that AI that is trained on ‘the commons’ — the body of freely available material — contributes to the commons in return.

当材料在鼓励自由分发和重复使用的许可下出版时,事情变得更加模糊,但仍然有一定的限制。位于加州山景城(Mountain View)的非营利组织“知识共享”(Creative Commons)旨在增加创意作品的共享,该组织表示,复制材料来训练人工智能通常不应被视为侵权。但它也承认了对人工智能对创造者的影响的担忧,以及如何确保在“公共资源”上训练的人工智能——免费获得的材料的主体——反过来为公共资源做出贡献。

23. creative [krɪ(ː)ˈeɪtɪv] adj. 创造(性)的,创作的;有创造力的,有想象力的 n. 创作者;创意,创作素材

24. impact [ˈɪmpækt] n. 撞击,冲击力;巨大影响,强大作用 v. 冲击,撞击;挤入,压紧;(对……)产生影响

25. ensure [ɪnˈʃʊə] v. 确保,保证;保护,使安全;ensures;ensuring;ensured;ensured

6参考↓■These, 这些; ■broader, 更广泛; ■questions, 问题; ■fairness, 公平; ■particularly, 尤其; ■pressing, 紧迫的; ■artists, 艺术家; ■writers, 作家; ■coders, 编码员; ■whose, 谁的; ■livelihoods, 生计; ■depend, 依赖; ■their, 他们的; ■creative, 创造性的; ■outputs, 输出; ■whose, 谁的; ■risks, 风险; ■being, 存在; ■replaced, 已更换; ■products, 产品; ■generative, 生成性; ■highly, 高度地; ■relevant, 相关的; ■researchers, 研究人员; ■towards, 朝着; ■open-access, 开放存取; ■publishing, 出版业; ■explicitly, 明确地; ■favours, 恩惠; ■distribution, 分布; ■reuse, 重新使用; ■scientific, 科学的; ■presumably, 大概吧; ■applies, 应用; ■Learning, 学习; ■scientific, 科学的; ■papers, 纸张; ■better, 更好的; ■researchers, 研究人员; ■might, 可以; ■rejoice, 高兴; ■improved, 改进; ■models, 模型; ■could, 能够; ■insights, 洞察力;

These broader questions of fairness are particularly pressing for artists, writers and coders, whose livelihoods depend on their creative outputs and whose work risks being replaced by the products of generative AI. But they are also highly relevant for researchers. The move towards open-access publishing explicitly favours the free distribution and reuse of scientific work — and this presumably applies to LLMs, too. Learning from scientific papers can make LLMs better, and some researchers might rejoice if improved AI models could help them to gain new insights.


22. code [kəʊd] n. 密码,暗码;(邮政)编码,(电话)区号;(计算机)编码;道德准则,行为规范;法典,法规 v. 把……编码(或编号);把……译成密码;(给计算机)编写指令 【名】 (code)(英、法、西)科德(人名);codes;codes;coding;coded;coded

23. creative [krɪ(ː)ˈeɪtɪv] adj. 创造(性)的,创作的;有创造力的,有想象力的 n. 创作者;创意,创作素材

26. highly [ˈhaɪlɪ] adv. 极其,非常;高度地,高水平地;钦佩地,赞赏地;在高处,地位高

27. relevant [ˈrelɪvənt] adj. 有关的,切题的;正确的,适宜的;有价值的,有意义的

28. presumably [prɪˈzjuːməbəlɪ] adv. 大概,可能

29. learning [ˈlɜːnɪŋ] n. 学习;知识,学问 v. 得知,获悉;学习,学会;认识到,从……吸取教训(learn 的现在分词形式)

30. rejoice [rɪˈdʒɔɪs] v. 非常高兴,深感欣喜;享有(用于吸引人注意奇特之处,尤指名字)(rejoice in);使感到高兴,使喜悦;rejoices;rejoicing;rejoiced;rejoiced

31. insight [ˈɪnsaɪt] n. 洞悉,了解;洞察力 【名】 (insight)(英)因赛特(人名);insights

7参考↓■Credit, 信用卡; ■where, 哪里;

Credit where it is due


8参考↓■others, 其他; ■worried, 担心; ■about, 关于; ■principles, 原则; ■attribution, 归因; ■currency, 货币; ■which, 哪一个; ■science, 科学; ■operates, 操作; ■attribution, 归因; ■condition, 条件; ■reuse, 重新使用; ■under, 在下面; ■commonly, 通常地; ■open-access, 开放存取; ■copyright, 版权; ■license, 许可证; ■jurisdictions, 司法管辖区; ■European, 欧洲的; ■Union, 工会; ■Japan, 日本; ■there, 那里; ■exemptions, 豁免; ■copyright, 版权; ■rules, 规则; ■cover, 盖; ■factors, 因素; ■attribution, 归因; ■mining, 采矿; ■research, 研究; ■using, 使用; ■automated, 自动化; ■analysis, 分析; ■sources, 来源; ■patterns, 模式; ■example, 例子; ■scientists, 科学家; ■data-scraping, 数据抓取; ■proprietary, 专有的; ■going, 去; ■beyond, 超过; ■these, 这些; ■exemptions, 豁免; ■intended, 有意; ■achieve, 实现;

But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.

但也有人对归因等原则感到担忧,归因是科学运作的货币。公平署名是CC BY(一种常用的开放获取版权许可协议)下的重用条件。在欧盟和日本等司法管辖区,版权规则的豁免涵盖了归属等因素——例如,在使用来源自动分析来寻找模式的研究中,对文本和数据进行挖掘。一些科学家认为,为专有法学硕士收集法学硕士数据远远超出了这些豁免的本意。

17. copyright [ˈkɒpɪraɪt] n. 版权,著作权 adj. 版权的,受版权保护的 v. 获得……的版权;保护……的版权;copyrights;copyrights;copyrighting;copyrighted;copyrighted

32. license [ˈlaɪsəns] n. 执照,许可证;特许(同 licence) vt. 许可;特许;发许可证给;licenses;licenses;licensing;licensed;licensed

33. factor [ˈfæktə] n. 因素,要素;等级,系数;因数,因子;遗传因子,基因;(血液中的)凝血因子;代理公司,代理商;地产管理人,管家;测量水平 v. 把……作为因素计入,把……包括在内(factor in);把……作为因素排除,不把……包括在内(factor out);将……分解为因子;代理经营,(代管)产业;做代理商 【名】 (factor)(英)法克特(人名);factors;factors;factoring;factored;factored

34. source [sɔːs] n. 来源,出处;(问题的)原因,根源;消息人士,信息来源;河流源头,发源地;源(代)码;(电子)源极,电源;(技)源 v. (从某地)获得;找出……的来源 【名】 (source)(法)苏尔斯(人名);sources;sources;sourcing;sourced;sourced

9参考↓■attribution, 归因; ■impossible, 不可能的; ■large, 大的; ■commercial, 商业的; ■millions, 数以百万计的; ■sources, 来源; ■generate, 生成; ■given, 鉴于; ■output, 输出; ■developers, 开发者; ■create, 创造; ■tools, 工具; ■science, 科学; ■method, 方法; ■known, 已知; ■retrieval-augmented, 检索增强; ■generation, 一代; ■could, 能够; ■technique, 技术; ■doesn’t, 没有; ■apportion, 分摊; ■credit, 信用; ■trained, 训练; ■allow, 允许; ■model, 模型; ■papers, 纸张; ■relevant, 相关的; ■output, 输出; ■researcher, 研究员; ■University, 大学; ■Washington, 华盛顿; ■Seattle, 西雅图;

In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.

在任何情况下,当大型商业法学硕士使用数百万个源来生成给定的输出时,归属是不可能的。但是,当开发人员创建用于科学的人工智能工具时,一种称为检索增强生成的方法可能会有所帮助。西雅图华盛顿大学(University of Washington)的人工智能研究员王璐(Lucy Lu Wang)表示,这种技术不会将功劳分配给训练法学硕士的数据,但确实允许该模型引用与其产出相关的论文。

9. generate [ˈdʒenəˌreɪt] v. 产生,引起;generates;generating;generated;generated

27. relevant [ˈrelɪvənt] adj. 有关的,切题的;正确的,适宜的;有价值的,有意义的

34. source [sɔːs] n. 来源,出处;(问题的)原因,根源;消息人士,信息来源;河流源头,发源地;源(代)码;(电子)源极,电源;(技)源 v. (从某地)获得;找出……的来源 【名】 (source)(法)苏尔斯(人名);sources;sources;sourcing;sourced;sourced

35. cite [saɪt] v. 引用,援引;引证,引以为例;传唤,传讯;嘉奖,表彰 n. 引用,引文;cites;cites;citing;cited;cited

10参考↓■Giving, 给; ■researchers, 研究人员; ■ability, 能力; ■having, 有; ■their, 他们的; ■training, 训练; ■could, 能够; ■their, 他们的; ■worries, 担忧; ■Creators, 创作者; ■right, 正确的; ■under, 在下面; ■tough, 艰难的; ■enforce, 执行; ■practice, 实践; ■Yaniv, 雅尼夫; ■Benhamou, 本哈穆; ■studies, 研究; ■digital, 数字的; ■copyright, 版权; ■University, 大学; ■Geneva, 日内瓦; ■Firms, 公司; ■devising, 设计; ■innovative, 创新; ■easier, 更容易的; ■Spawning, 产卵; ■start-up, 启动; ■company, 公司; ■Minneapolis, 明尼阿波利斯; ■Minnesota, 明尼苏达州; ■developed, 发达的; ■tools, 工具; ■allow, 允许; ■creators, 创作者; ■scraping, 刮擦; ■developers, 开发者; ■getting, 得到; ■board, 板; ■OpenAI’s, OpenAI; ■Media, 媒体中心; ■Manager, 经理; ■example, 例子; ■allows, 允许; ■creators, 创作者; ■specify, 指定; ■their, 他们的; ■works, 作品; ■machine-learning, 机器学习; ■algorithms, 算法;

Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms.

让研究人员有权选择不让他们的研究成果用于法学硕士培训,也可以缓解他们的担忧。在日内瓦大学研究数字法和版权的Yaniv Benhamou说,根据欧盟法律,创作者有这种权利,但在实践中很难执行。公司正在设计创新的方法使之更容易。明尼苏达州明尼阿波利斯市的一家初创公司Spawning开发了一种工具,允许创作者选择退出数据抓取。一些开发人员也加入了进来:例如,OpenAI的媒体管理器工具允许创作者指定他们的作品如何被机器学习算法使用。

17. copyright [ˈkɒpɪraɪt] n. 版权,著作权 adj. 版权的,受版权保护的 v. 获得……的版权;保护……的版权;copyrights;copyrights;copyrighting;copyrighted;copyrighted

29. learning [ˈlɜːnɪŋ] n. 学习;知识,学问 v. 得知,获悉;学习,学会;认识到,从……吸取教训(learn 的现在分词形式)

36. opt [ɒpt] v. 选择,作出抉择

37. enforce [ɪnˈfɔːs] v. 实施,执行(法律、规章);强迫,迫使;竭力使人接受(要求,论点);enforces;enforcing;enforced;enforced

40. specify [ˈspesɪfaɪ] v. 明确指出;具体说明;把……列入说明书;specifies;specifying;specified;specified

11参考↓■Greater, 更大的; ■transparency, 透明度; ■which, 哪一个; ■force, 力; ■August, 八月; ■requires, 要求; ■developers, 开发者; ■publish, 出版; ■summary, 总结; ■works, 作品; ■train, 火车; ■their, 他们的; ■models, 模型; ■could, 能够; ■bolster, 长枕; ■creators’, 创作者; ■ability, 能力; ■might, 可以; ■serve, 服务; ■template, 模板; ■other, 其他; ■jurisdictions, 司法管辖区; ■remains, 残余; ■practice, 实践;

Greater transparency can also play a part. The EU’s AI Act, which came into force on 1 August, requires developers to publish a summary of the works used to train their AI models. This could bolster creators’ ability to opt out, and might serve as a template for other jurisdictions. But it remains to be seen how this will work in practice.


36. opt [ɒpt] v. 选择,作出抉择

41. remains [rɪˈmeɪns] n. 剩余物,残留物;遗体,遗骸;古迹,遗迹 v. 仍然是,保持不变;逗留,留下;剩余,余留(remain 的第三人称单数形式)

12参考↓■Meanwhile, 与此同时; ■research, 研究; ■should, 应该; ■continue, 持续; ■whether, 是否; ■there, 那里; ■more-radical, 更激进; ■solutions, 解决; ■kinds, 种类; ■licence, 许可证; ■changes, 变化; ■copyright, 版权; ■Generative, 生成; ■tools, 工具; ■using, 使用; ■ecosystem, 生态系统; ■built, 建造; ■open-source, 开放源代码; ■movements, 运动; ■often, 经常; ■ignore, 忽视; ■accompanying, 伴随着; ■expectations, 期望; ■reciprocity, 互惠性; ■reasonable, 合理的; ■Sylvie, 西尔维; ■Delacroix, 德拉克洛瓦; ■digital-law, 数字法; ■scholar, 学者; ■King’s, 国王的; ■College, 学院; ■London, 伦敦; ■tools, 工具; ■polluting, 污染; ■Internet, 互联网; ■AI-generated, AI生成; ■content, 内容; ■dubious, 可疑; ■quality, 质量; ■failing, 弱点; ■redirect, 重定向; ■users, 用户; ■human-made, 人造的; ■sources, 来源; ■which, 哪一个; ■built, 建造; ■could, 能够; ■disincentivize, 抑制; ■original, 起初的; ■creation, 创造; ■Without, 没有; ■putting, 放; ■power, 权力; ■hands, 手; ■creators, 创作者; ■system, 系统; ■under, 在下面; ■severe, 严峻的; ■strain, 拉紧; ■Regulators, 监管机构; ■companies, 公司;

Meanwhile, research should continue into whether there is a need for more-radical solutions, such as new kinds of licence or changes to copyright law. Generative AI tools are using a data ecosystem built by open-source movements, yet often ignore the accompanying expectations of reciprocity and reasonable use, says Sylvie Delacroix, a digital-law scholar at King’s College London. The tools also risk polluting the Internet with AI-generated content of dubious quality. By failing to redirect users to the human-made sources on which they were built, LLMs could disincentivize original creation. Without putting more power into the hands of creators, the system will come under severe strain. Regulators and companies must act.

与此同时,应该继续研究是否需要更激进的解决方案,比如新的许可形式或修改版权法。伦敦国王学院(King 's College London)数字法学者西尔维•德拉克罗瓦(Sylvie Delacroix)表示,生成式人工智能工具正在使用由开源运动构建的数据生态系统,但往往忽视了随之而来的互惠和合理使用的期望。这些工具还可能会污染互联网,因为人工智能生成的内容质量可疑。由于未能将用户重定向到构建它们的人造资源,llm可能会抑制原创创作。如果不赋予创作者更多的权力,这个系统将面临严重的压力。监管机构和企业必须采取行动。

9. generate [ˈdʒenəˌreɪt] v. 产生,引起;generates;generating;generated;generated

17. copyright [ˈkɒpɪraɪt] n. 版权,著作权 adj. 版权的,受版权保护的 v. 获得……的版权;保护……的版权;copyrights;copyrights;copyrighting;copyrighted;copyrighted

34. source [sɔːs] n. 来源,出处;(问题的)原因,根源;消息人士,信息来源;河流源头,发源地;源(代)码;(电子)源极,电源;(技)源 v. (从某地)获得;找出……的来源 【名】 (source)(法)苏尔斯(人名);sources;sources;sourcing;sourced;sourced

42. radical [ˈrædɪkəl] adj. 根本的,彻底的;激进的,极端的; 顶呱呱的;全新的,不同凡响的;(增减)急剧的,大幅度的;(人,物)原本的,与生俱来的;(外科,医疗)根治的;(19世纪)自由党激进派的;(数)根式的,根号的;词根的;(植)根生的 n. 激进分子; 游离基,自由基;词根;(汉字)偏旁,部首;(数)根式;根号;radicals

43. accompany [əˈkʌmpənɪ] v. 陪伴,陪同;伴随,与……一起发生;为……伴奏(或伴唱);附带,补充;accompanies;accompanying;accompanied;accompanied

44. expectation [ˌekspekˈteɪʃən] n. 期待,预期;期望,指望;expectations

45. reasonable [ˈrɪːznəbl] adj. 有道理的,合情理的;(人)通情达理的,讲道理的;适度的,合适的;(价格)公道的;还算好的,尚可的;相当大的,(数量)不少的

46. dubious [ˈdʒuːbjəs] adj. 可疑的,靠不住的;有疑虑的;(荣誉、名声等)不好的,不光彩的;质量不佳的

47. strain [streɪn] n. 焦虑,紧张;负担,紧张;张力,压力;损伤,扭伤;品种,类型;气质,个性特点;旋律,曲调;口吻,语气;(物理)应变,胁变;血缘;困难,负担 v. 拉伤,扭伤;绷紧,用力拉;竭力,使劲;过滤;使不堪忍受,使紧张;用力推(或拉),拉紧 【名】 (strain)(英)斯特兰(人名);strains;strains;straining;strained;strained


