在快速发展的人工智能领域,我们正在见证 GPT(生成预训练 Transformer)模型领域的迷人转变。
GPT 是一种人工智能模型,旨在理解和生成类似人类的文本。可以将其视为一种非常先进的自动完成系统。
就像您在输入信息时手机可能会建议下一个单词一样,GPT 模型可以预测并生成整个句子甚至长篇文本。它已根据来自互联网、书籍和其他来源的大量文本进行了“预训练”,这使其能够学习语言模式。
“生成”部分意味着它可以创建新的内容,而不仅仅是重复它以前见过的内容。GPT 模型为许多 AI 聊天机器人和写作助手提供支持,帮助它们进行类似人类的对话,并针对几乎所有主题生成连贯的文本。最重要的是,GPT 实现了 AI 的圣杯:通过了图灵测试,人类无法分辨文本是“人工”生成的。
人工智能的进步——无论是技术层面还是应用层面——都是惊人的。据 OpenAI 称,客户通用技术服务 (GPT) 数量已超过 300 万,77% 的设备都配备了某种形式的人工智能。
我的挑衅性言论“GPT 已死,GPT 万岁!”概括了这种演变,强调的不是 GPT 架构的消亡,而是它从纯文本应用程序演变为更强大、更奇妙的形式。
GPT 的演变
早期的 GPT 模型仅专注于文本,而现在,新一代功能更强大、更灵活的系统已经取代了它们。虽然这些以文本为中心的模型在自然语言处理方面表现出色,例如可以生成类似人类的文本、理解上下文,并以令人印象深刻的准确性执行各种与语言相关的任务,但它们仅限于单一模式。
那么,什么已经过时了?GPT 模型仅限于文本的观念。GPT 架构远未过时,而是非常活跃且不断发展。随着研究人员和开发人员不断突破 GPT 的极限,一场变革开始形成。以 GPT-4 等模型为例,最新的迭代已经摆脱了纯文本处理的限制。这些先进的模型采用了多模态,能够理解和处理文本和图像。
例如,LegalGPT是多模态 GPT,允许处理文本和图像数据。此功能使该工具能够处理诸如分析法律文件(包括合同或案件文件的扫描图像)等任务,同时还提供基于文本的详细见解。例如,LegalGPT 可以解释复杂的法律文件并识别重要条款或问题,使其成为经常处理文本和视觉信息(例如扫描的 PDF)的法律专业人士的多功能工具。
这一飞跃代表着我们与技术的关系发生了根本性转变。它不再仅仅是一种工具;它现在是一双额外的双手,其能力接近甚至在某些情况下超越了人类员工。
由于 GPT 模型能够出色地完成许多职业中的“繁重工作”,因此它使人类能够全心全意地专注于我们擅长的工作。它为人工智能应用开辟了新视野,弥合了不同类型数据之间的差距,并为更复杂、更具有情境感知能力的人工智能系统铺平了道路。
纯文本 GPT 的消亡催生了多模态 AI 的新时代,这些模型可以以更接近模仿人类认知的方式与世界互动并理解世界。
多模式人工智能的兴起
GPT 向多模态化的演进是人工智能领域更广泛、更令人兴奋的趋势的一部分。多模态人工智能系统能够同时处理和生成多种类型的数据,正在彻底改变机器理解和与世界互动的方式。
这些系统可以整合来自各种来源(文本、图像、音频甚至视频)的信息,从而更全面地了解周围环境。这种增强的感知能力可以做出更细致入微、更准确的反应,模仿人类处理来自多种感官的信息的方式。
例如,音乐行业中一个有趣的多模态 GPT 是AIVA(人工智能虚拟艺术家)。AIVA 使用文本和声音作为输入,允许用户根据以文本形式描述的特定风格或情感生成音乐。它可以解释这些文本提示并输出相应的音频,这对于寻找灵感或快速草稿的作曲家或制作人非常有用。AIVA 已用于为电影、商业广告甚至视频游戏创作背景音乐,展示了多模态 AI 如何将创意输入融合到文本和声音中。
《侍僧》粉丝请愿表明迪士尼取消《星球大战》失败之作的决定是多么正确
今日纽约时报迷你填字游戏线索和答案(9 月 12 日,星期四)
纽约时报今日“Strands”:9 月 12 日星期四的提示、Spangram 和答案
这种转变的影响是深远而深远的。在医疗保健领域,多模态人工智能可以分析医学图像和患者病史,从而大幅提高诊断的准确性。以医疗保健为中心的多模态 GPT 的一个例子是谷歌开发的 Med-Gemini。Med-Gemini 以 Gemini 系列模型为基础,专门针对医疗应用进行了微调。它结合了文本、图像甚至 3D 扫描,以协助临床工作流程,例如生成放射学报告、回答临床问题和提供诊断支持。
Med-Gemini 已在胸部 X 光片的视觉问答、3D 成像报告生成和基因组风险预测等任务上进行了基准测试。这些多模式功能旨在通过整合多种数据类型来改善临床推理,使其成为放射学、病理学和基因组学领域的强大工具。
此外,多模态通用技术的应用范围非常广泛。自动驾驶汽车可以通过整合视觉、听觉和文本数据做出瞬间决策。随着人工智能帮助融合不同类型的媒体,创意产业可能会迎来大量新形式的艺术和设计。也许最令人兴奋的是,多模态人工智能有可能打破沟通障碍,为人类与机器的互动提供更自然、更直观的方式。随着这些系统的不断发展,我们正站在人工智能新时代的边缘——在这个时代,不同类型的数据之间的界限变得模糊,人工智能对世界的理解越来越接近我们自己的理解。
多模式人工智能的重要性
向多模态人工智能的转变不仅仅代表着一项技术进步,它还是一种范式转变,对社会的各个领域和方面都具有深远的影响。尽管人工智能的“自动完成”功能非常出色,但仅凭文本并不能解决世界上所有的问题。通过增强人工智能对上下文和细微差别的理解,多模态系统有望彻底改变人机交互,使其更加自然和直观。这种改进的交互为解决复杂的现实世界挑战打开了大门,而这些挑战以前是单模态人工智能无法解决的。
此外,这种转变开启了创造力和创新的新领域,有可能改变艺术、设计和科学研究等领域。多模态人工智能连接不同类型信息的能力也对无障碍具有深远影响,为残障人士提供更有效的沟通工具。
最令人兴奋的是,通过将视觉背景与语言处理相结合,多模态人工智能可以打破语言障碍,促进全球范围内跨文化理解和交流。总的来说,这些进步凸显了多模态人工智能不仅仅是一项技术进步,而是一种革命性的力量,它可以重塑我们与机器互动、处理信息并最终理解世界的方式。
处于前沿的公司
一些科技巨头和创新型初创公司在开发先进的 GPT 模型和多模式 AI 方面处于领先地位:
- OpenAI:OpenAI 以其 GPT 模型而闻名,在可以处理文本和图像的 GPT-4 方面取得了重大进展。
- Google DeepMind:该公司的 PaLM-E 模型将大型语言模型与机器人控制相结合,展示了多模态 AI 在物理交互中的潜力。
- Anthropic 和 AWS:尽管公众对他们的具体努力知之甚少,但 Anthropic 一直在突破 AI 能力和道德 AI 发展的界限。
创新型初创企业
- Hugging Face:这家初创公司已成为开源 AI 模型的中心,其中包括许多基于 GPT 和多模式的项目。
- Adept AI Labs:由前 OpenAI 和谷歌研究人员创立,Adept 致力于开发可与软件界面交互的 AI 模型。
- 稳定性人工智能:稳定性人工智能以其在稳定扩散方面的工作而闻名(在其众多用途中,它能够根据文本提示创建详细的图像),它正在多种模式下突破生成人工智能的界限。
衡量 GPT 演进的影响随着 GPT 模型从纯文本演进到多模式功能,一个关键问题出现了:我们如何有效地评估这些日益复杂的 AI 系统?虽然专家基准提供了宝贵的技术见解,但它们可能无法完全捕捉最重要的指标:最终用户满意度。
认识到这一差距,由Salman Paracha(和我一样是 AWS 校友)和Katanemo领导的一项开源计划启动了一项针对大型语言模型 (LLM) 的“人类”基准研究。这项研究旨在衡量最终用户与 LLM(包括最新的多模式 GPT 模型)交互时最重要的质量走廊。
Katanemo 基准测试旨在回答以下关键问题:
- 是否存在一个阈值,超过该阈值,LLM 响应质量的改善就不再显著影响用户满意度?
- 这些模型的性能在什么时候会下降到用户无法接受的水平?
通过参与简短的 30 秒调查,个人可以为这项社区工作贡献宝贵的数据。这项研究的结果将有助于建立 LLM 质量的标准化衡量标准,使研究人员、开发人员和消费者能够更明智地决定使用或进一步开发哪些模型。
未来之路
随着我们从“GPT 已死”转变为“GPT 万岁”,我们正进入一个新时代,这些模型将变得更加通用、更强大,并更加融入我们的日常生活。向多模态的演进不仅代表着技术进步,也代表着我们与人工智能互动方式的范式转变。
然而,能力越大,责任越大。更先进的 GPT 模型和多模态 AI 的开发引发了新的道德问题和挑战,特别是在深度伪造和隐私等领域。研究人员、公司和政策制定者必须共同努力,确保负责任地开发和部署这些强大的新工具。
GPT 的未来不是一种旧技术的消亡,而是新可能性的诞生。随着这些模型不断发展并融入多模式功能,我们可以期待看到曾经是科幻小说中的应用成为现实,从根本上改变我们与技术和周围世界的互动方式。
在Twitter 或 LinkedIn上关注我 。
GPT Is Dead. Long Live GPTs.
In the rapidly evolving landscape of artificial intelligence, we're witnessing a fascinating transformation in the world of GPT (Generative Pre-trained Transformer) models.
GPT is a type of artificial intelligence model designed to understand and generate human-like text. Think of it as a very advanced autocomplete system.
Just as your phone might suggest the next word when you're typing a message, a GPT model can predict and generate entire sentences or even long passages of text. It's "pre-trained" on a vast amount of text from the internet, books, and other sources, which allows it to learn patterns in language.
The "generative" part means it can create new content, not just repeat what it has seen before. GPT models power many AI chatbots and writing assistants, helping them to engage in human-like conversations and produce coherent text on almost any topic. Most significantly of all, GPT has achieved the Holy Grail of AI: passing the Turing Test, where humans cannot tell that text has been produced ‘artificially’.
The advance of AI - both technologically and in terms of adoption - has been nothing short of phenomenal. According to OpenAI, there are more than 3M customer GPTs, and 77% of devices being used have some form of AI.
My provocative statement "GPT is dead. Long live GPTs!" encapsulates this evolution, highlighting not the demise of GPT architecture, but rather its evolution from text-only applications to even more capable and wondrous forms.
The Evolution of GPT
The early GPT models, with their singular focus on text, have given way to a new generation of more versatile and capable systems. While these text-centric models showcased remarkable abilities in natural language processing - generating human-like text, understanding context, and performing a wide array of language-related tasks with impressive accuracy - they were limited to a single modality.
So what’s dead? The notion that GPT models are confined to text alone. The GPT architecture, far from being obsolete, is very much alive and continuously evolving. As researchers and developers pushed the boundaries of what GPT could do, a transformation began to take shape. The latest iterations, exemplified by models like GPT-4, have broken free from the constraints of text-only processing. These advanced models have embraced multimodality, capable of understanding and processing both text and images.
For example, LegalGPT is multimodal GPT, allowing the processing of both text and image data. This feature enables the tool to handle tasks such as analyzing legal documents, including scanned images of contracts or case files, while also providing detailed text-based insights. For example, LegalGPT can interpret complex legal documents and identify important clauses or issues, making it a versatile tool for legal professionals who often deal with both textual and visual information, such as scanned PDFs.
This leap forward represents a fundamental shift in our relationship with technology. It is no longer merely a tool; it is now an extra pair of hands, with capabilities approaching and in some cases exceeding those of human employees.
Because GPT models do a good job on the "drudge-work" within so many professions, it enables humans to concentrate fully on what we do brilliantly. It opens up new horizons for AI applications, bridging the gap between different types of data and paving the way for more sophisticated, context-aware AI systems.
The death of text-only GPT has given birth to a new era of multimodal AI, where these models can interact with and understand the world in ways that more closely mimic human cognition.
The Rise of Multimodal AI
The evolution of GPT towards multimodality is part of a broader, exciting trend in the field of artificial intelligence. Multimodal AI systems, capable of processing and generating multiple types of data simultaneously, are revolutionizing how machines understand and interact with the world.
These systems can integrate information from various sources - text, images, audio, and even video - to form a more comprehensive understanding of their environment. This enhanced perception allows for more nuanced and accurate responses, mimicking the way humans process information from multiple senses.
For example, an interesting multimodal GPT in the music industry is AIVA (Artificial Intelligence Virtual Artist). AIVA uses both text and sound as input, allowing users to generate music based on specific styles or emotions described in text form. It can interpret these text prompts and output corresponding audio, making it useful for composers or producers looking for inspiration or quick drafts. AIVA has been used in creating background scores for films, commercials, and even video games, showcasing how multimodal AI can blend creative input across text and sound.
‘The Acolyte’ Fan Petition Shows Just How Right Disney Was To Cancel The ‘Star Wars’ Flop, After All
Today’s NYT Mini Crossword Clues And Answers For Thursday, September 12
NYT ‘Strands’ Today: Hints, Spangram And Answers For Thursday, September 12th
The implications of this shift are profound and far-reaching. In healthcare, multimodal AI that analyzes medical images alongside patient histories is already being used to drastically improve the accuracy of diagnoses. One example of a healthcare-focused multimodal GPT is Med-Gemini, developed by Google. Med-Gemini builds on the Gemini family of models and is specifically fine-tuned for medical applications. It combines text, images, and even 3D scans to assist in clinical workflows such as generating radiology reports, answering clinical questions, and offering diagnostic support.
Med-Gemini has been benchmarked on tasks like visual question-answering for chest X-rays, report generation for 3D imaging, and genomic risk prediction. These multimodal capabilities are designed to improve clinical reasoning by integrating diverse data types, making it a powerful tool in radiology, pathology, and genomics.
In addition, there are so many more applications of multi-modal GPTs. Autonomous vehicles could make split-second decisions by integrating visual, auditory, and textual data. Creative industries might see an explosion of new forms of art and design, as AI assists in blending different media types. Perhaps most excitingly, multimodal AI has the potential to break down communication barriers, offering more natural and intuitive ways for humans to interact with machines. As these systems continue to develop, we stand on the brink of a new era in artificial intelligence - one where the lines between different types of data blur, and AI's understanding of the world grows ever closer to our own.
The Importance of Multimodal AI
The shift to multimodal AI represents far more than just a technological advancement; it's a paradigm shift with far-reaching implications across various fields and aspects of society. Remarkable as AI’s “autocomplete” capabilities are, you can’t solve all the world’s problems by text alone. By enhancing AI's understanding of context and nuance, multimodal systems promise to revolutionize human-AI interaction, making it more natural and intuitive. This improved interaction opens doors to solving complex, real-world challenges that were previously out of reach for single-modality AI.
Moreover, this shift unlocks new realms of creativity and innovation, potentially transforming fields like art, design, and scientific research. The ability of multimodal AI to bridge different types of information also has profound implications for accessibility, allowing for more effective communication tools for people with disabilities.
Perhaps most excitingly, by integrating visual context with language processing, multimodal AI could break down language barriers, fostering improved cross-cultural understanding and communication on a global scale. Collectively, these advancements underscore how multimodal AI is not merely an evolution in technology, but a revolutionary force that could reshape how we interact with machines, process information, and ultimately understand our world.
Companies at the Forefront
Several tech giants and innovative startups are leading the charge in developing advanced GPT models and multimodal AI:
- OpenAI: Known for its GPT models, OpenAI has made significant strides with GPT-4, which can process both text and images.
- Google DeepMind: The company's PaLM-E model integrates large language models with robotic control, showcasing the potential of multimodal AI in physical interactions.
- Anthropic and AWS: While less is publicly known about their specific efforts, Anthropic has been pushing the boundaries of AI capabilities and ethical AI development.
Innovative Startups
- Hugging Face: This startup has become a hub for open-source AI models, including many GPT-based and multimodal projects.
- Adept AI Labs: Founded by former OpenAI and Google researchers, Adept is working on AI models that can interact with software interfaces.
- Stability AI: Known for their work on Stable Diffusion (which, among its many uses, enables the creation of detailed images from text prompts), Stability AI is pushing the boundaries of generative AI across multiple modalities.
Measuring the Impact of GPT EvolutionAs GPT models evolve from text-only to multimodal capabilities, a crucial question emerges: How do we effectively evaluate these increasingly sophisticated AI systems? While expert benchmarks provide valuable technical insights, they may not fully capture the most important metric: end-user satisfaction.
Recognizing this gap, an open-source initiative led by Salman Paracha (an AWS alumni like me) and Katanemo's has launched a "human" benchmark study for Large Language Models (LLMs). This study aims to measure the quality corridor that matters most to end users when interacting with LLMs, including the latest multimodal GPT models.
The Katanemo benchmark seeks to answer critical questions:
- Is there a threshold beyond which improvements in LLM response quality no longer significantly impact user satisfaction?
- At what point does the performance of these models dip to levels that users find unacceptable?
By participating in a brief 30-second surveyindividuals can contribute valuable data to this community effort. The results of this study will help establish a standardized measure for LLM quality, empowering researchers, developers, and consumers to make more informed decisions about which models to use or further develop.
The Road Ahead
As we move from "GPT is dead" to "Long live GPTs," we're entering an era where these models are becoming more versatile, more capable, and more integrated into our daily lives. The evolution towards multimodality represents not just a technological advancement, but a paradigm shift in how we interact with AI.
However, with great power comes great responsibility. The development of more advanced GPT models and multimodal AI raises new ethical concerns and challenges, particularly in areas like deepfakes and privacy. It will be crucial for researchers, companies, and policymakers to work together to ensure that these powerful new tools are developed and deployed responsibly.
The future of GPTs is not about the death of an old technology, but the birth of new possibilities. As these models continue to evolve and incorporate multimodal capabilities, we can expect to see applications that were once the stuff of science fiction become reality, fundamentally changing how we interact with technology and the world around us.
Follow me on Twitter or LinkedIn.