「经济学人」李飞飞“AI是时候该超越GPT了”「精校翻译」11月第三周

教育   2024-11-22 07:05   上海  

「免费领取外刊+投行研报」加入金融&经济&英语」学习群

KEY POINTS(核心要点)

1-For computers to have the spatial intelligence of humans, they need to be able to model the world, reason about things and places, and interact in both time and 3D space. In short, we need to go from large language models to large world models.

为了让计算机拥有与人类相似的空间智能,它们需要能够模拟世界,对事物和地点进行推理,并在时间和三维空间中进行交互。简而言之,我们需要从大型语言模型发展到大型世界模型。


2-The applications are endless. Imagine robots that can navigate ordinary homes and look after old people; a tireless

set of extra hands for a surgeon; or the uses in simulation, training and education. This is truly human-centred Al, and spatial intelligence is its next frontier. What took hundreds of millions of years to evolve in humans is taking just decades to emerge in computers. And we humans will be the beneficiaries.

应用前景无限。想象一下,机器人能够导航普通家庭并照顾老年人;对于外科医生来说,它们是不知疲倦的额外帮手;或者在模拟、培训和教育中的用途。这才是真正的以人为中心的人工智能,空间智能是它的下一个前沿。在人类身上演化了几亿年的东西,在计算机中只用了几十年就出现了。而我们人类将成为受益者。

Fei-Fei Li says understanding how the world works is the next step for AI

Time to look beyond language models, argues the Stanford professor and "godmother of AI"

李飞飞表示,了解世界是如何运作的是人工智能的下一步

这位斯坦福大学教授、“人工智能教母”认为,是时候超越语言模型了。


Language is full of visual aphorisms. Seeing is believing. A picture is worth a thousand words. Out of sight, out of mind. The list goes on. This is because we humans draw so much meaning from our vision. But seeing was not always possible. Until about 540m years ago, all organisms lived below the surface of the water and none of them could see. Only with the emergence of trilobites could animals, for the first time, perceive the abundance of sunlight around them. What ensued was remarkable. Over the next 10m-15m years, the ability to see ushered in a period known as the Cambrian explosion, in which the ancestors of most modern animals appeared. 

语言中充满了视觉隐喻。“眼见为实”、“一图胜千言”、“眼不见心不烦”等等。这是因为我们人类从视觉中获取了大量的意义。但在以前,并不是总能“看到”。大约在5.4亿年前,所有生物都生活在水下,它们都不能看见。直到三叶虫的出现,动物们才第一次能够感知到周围丰富的阳光。随之而来的是非凡的变化。在接下来的1000万至1500万年里,视觉能力的出现引发了一个被称为寒武纪大爆发的时期,在这一时期,大多数现代动物的祖先出现了。



Today we are experiencing a modern-day Cambrian explosion in artificial intelligence (AI). It seems as though a new, mind-boggling tool becomes available every week. Initially, the generative-AI revolution was driven by large language models like ChatGPT, which imitate humans' verbal intelligence. But I believe an intelligence based on vision--what I call spatial intelligence-is more fundamental. Language is important but, as humans, much of our ability to understand and interact with the world is based on what we see.

今天我们正在经历一场现代版的寒武纪大爆发,在人工智能(AI)领域。似乎每周都有新的、令人难以置信的工具问世。最初,生成式AI革命是由像ChatGPT这样的大型语言模型推动的,它们模仿人类的语言智能。但我相信,基于视觉的智能——我称之为空间智能——更为根本。语言很重要,但作为人类,我们理解和与世界互动的大部分能力是基于我们所看到的。


A subfield of AI known as computer vision has long sought to teach computers to have the same or better spatial intelligence as humans. The field has progressed rapidly over the past 15 years. And, guided by the core belief that AI needs to advance with human benefit at its centre, I have dedicated my career to it.

人工智能的一个子领域,即计算机视觉,长期以来一直在寻求教会计算机拥有与人类相同甚至更优越的空间智能。在过去的15年里,这个领域取得了迅速的进展。而且,基于AI需要以人类利益为中心的核心信念,我将我的职业生涯奉献给了这一领域。


No one teaches a child how to see. Children make sense of the world through experiences and examples. Their eyes are like biological cameras, taking a "picture" five times a second. By the age of three, kids have seen hundreds of millions of such pictures. 

没有人教孩子如何去看。孩子们通过经历和例子来理解世界。他们的眼睛就像生物相机,每秒拍摄五次“照片”。到三岁时,孩子们已经看过数亿张这样的“照片”。


We know from decades of research that a fundamental element of vision is object recognition, so we began by teaching computers this ability. It was not easy. There are infinite ways to render the three-dimensional (3D) shape of a cat, say, into a two-dimensional (2D) image, depending on viewing angle, posture, background and more. For a computer to identify a cat in a picture it needs to have a lot of information, like a child does.

几十年的研究表明,视觉的一个基本要素是物体识别,因此我们首先教会计算机这种能力。这并不容易。将三维(3D)形状的物体,比如一只猫,渲染成二维(2D)图像有无数种方式,这取决于观察角度、姿势、背景等因素。为了让计算机在图片中识别出一只猫,它需要拥有大量的信息,就像孩子一样。


This was not possible until three elements converged in the mid-2000s. At that point algorithms known as convolutional neural networks, which had existed for decades, met the power of modern-day graphics processing units (GPUs) and the availability of "big data"-billions of images from the internet, digital cameras and so forth.


在21世纪中期,三个要素的汇聚使得这一切成为可能。那时,已经存在数十年的卷积神经网络算法,遇到了现代图形处理单元(GPUs)的强大计算能力,以及“大数据”的可用性——来自互联网、数码相机等的数十亿图像。


My lab contributed the "big data" element to this convergence. In 2007, in a project called ImageNet, we created a database of 15m labelled images across 22,000 object

categories. Then we and other researchers trained neural-network models using images and their corresponding textual

labels, so that the models learned to describe a previously unseen photo using a simple sentence. Unexpectedly rapid

progress in these image-recognition systems, created using the Image Net database, helped spark the modern Al boom.


我的实验室为这次汇聚贡献了“大数据”元素。2007年,在一个名为ImageNet的项目中,我们创建了一个包含1500万个标记图像的数据库,涵盖了22,000个对象类别。然后,我们和其他研究人员使用这些图像及其对应的文本标签来训练神经网络模型,使模型学会用一个简单的句子来描述之前未见过的图片。使用ImageNet数据库创建的这些图像识别系统的意外快速进展,帮助引发了现代人工智能的繁荣。


As technology progressed, a new generation of models, based on techniques such as transformer architectures and diffusion, brought with them the dawn of generative Al tools. In the realm of language, this made possible chatbots like ChatGPT. When it comes to vision, modern systems do not merely recognise but can also generate images and videos in response to text prompts. The results are impressive, but still only in 2D.


随着技术的进步,新一代的模型,基于变换器架构和扩散技术等技术,带来了生成性人工智能工具的黎明。在语言领域,这使得像ChatGPT这样的聊天机器人成为可能。在视觉领域,现代系统不仅能识别,还能根据文本提示生成图像和视频。结果令人印象深刻,但仍然仅限于2D。


For computers to have the spatial intelligence of humans, they need to be able to model the world, reason about things and places, and interact in both time and 3D space. In short, we need to go from large language models to large world models.


为了让计算机拥有与人类相似的空间智能,它们需要能够模拟世界,对事物和地点进行推理,并在时间和三维空间中进行交互。简而言之,我们需要从大型语言模型发展到大型世界模型。


We're already seeing glimpses of this in labs across academia and industry. With the latest Al models, trained using text, images, video and spatial data from robotic sensors and actuators, we can control robots using text prompts-asking them to unplug a phone charger or make a simple sandwich, for example. Or, given a 2D image, the model can transform it into an infinite number of plausible 3D spaces for a user to explore.


我们已经在学术界和工业界的实验室中看到了这方面的初步成果。利用最新的人工智能模型,通过文本、图像、视频以及来自机器人传感器和执行器的空间数据进行训练,我们可以使用文本提示来控制机器人——例如,让它们拔掉手机充电器或制作一个简单的三明治。或者,给定一个二维图像,模型可以将其转换成无限多可能的三维空间,供用户探索。


The applications are endless. Imagine robots that can navigate ordinary homes and look after old people; a tireless

set of extra hands for a surgeon; or the uses in simulation, training and education. This is truly human-centred Al, and spatial intelligence is its next frontier. What took hundreds of millions of years to evolve in humans is taking just decades to emerge in computers. And we humans will be the beneficiaries.

应用前景无限。想象一下,机器人能够导航普通家庭并照顾老年人;对于外科医生来说,它们是不知疲倦的额外帮手;或者在模拟、培训和教育中的用途。这才是真正的以人为中心的人工智能,空间智能是它的下一个前沿。在人类身上演化了几亿年的东西,在计算机中只用了几十年就出现了。而我们人类将成为受益者。


Economic英语进化群
核心分享“国际投行”高精中文翻译,以及其他优质金融经济相关英语学习资源。坚持每天训练,持续进化,成为更强的自己!
 最新文章