对 OpenAI o1 的第一印象:一款被设计用来过度思考的人工智能

文摘   2024-09-14 17:22   北京  


评论


上传失败,网络异常。

重试

图片来源: David Paul Morris/Bloomberg/Getty Images

OpenAI 于周四发布了其新的 o1 模型,让 ChatGPT 用户首次有机会尝试在回答之前先停下来“思考”的 AI 模型。OpenAI 内部代号为“Strawberry”的这些模型引起了很多炒作。但 Strawberry 是否名副其实呢?

与 GPT-4o 相比,o1 模型感觉像是前进了一步又退了两步。OpenAI o1 擅长推理和回答复杂问题,但该模型的使用成本大约是 GPT-4o 的四倍。OpenAI 的最新模型缺乏 GPT-4o 所具有的工具、多模态能力和速度。事实上,OpenAI 甚至在其帮助页面上承认“ GPT-4o 仍然是大多数提示的最佳选择”,并在其他地方指出 o1 在较简单的任务上表现不佳。

“这令人印象深刻,但我认为改进并不十分显著,”研究人工智能模型的纽约大学教授拉维德·施瓦茨·齐夫 (Ravid Shwartz Ziv) 表示。“它在某些问题上表现更好,但并没有实现全面的改进。”

出于所有这些原因,重要的是只将 o1 用于它真正设计用于解决的问题:大问题。需要明确的是,大多数人今天没有使用生成式人工智能来回答这类问题,主要是因为今天的人工智能模型在这方面不太擅长。然而,o1 是朝着这个方向迈出的尝试性的一步。

思考大创意

OpenAI o1 的独特之处在于它在回答之前会“思考”,将大问题分解成小步骤,并尝试确定其中哪个步骤正确或错误。这种“多步骤推理”并不完全是新方法(研究人员多年来一直提出这种方法,You.com 也将其用于复杂查询),但直到最近才开始实用。

“人工智能社区里有很多令人兴奋的事情,”Workera 首席执行官兼斯坦福大学兼职讲师 Kian Katanforoosh 在接受采访时表示,他教授机器学习课程。“如果你能训练一种强化学习算法,并结合 OpenAI 的一些语言模型技术,那么从技术上讲,你可以创建循序渐进的思维,并让人工智能模型从你试图解决的大想法中逆向而行。”

OpenAI o1 的价格也非常昂贵。在大多数模型中,您需要为输入代币和输出代币付费。但是,o1 增加了一个隐藏的过程(模型将大问题分解成小步骤),这增加了大量您从未完全看到的计算。OpenAI 隐藏了这个过程的一些细节以保持其竞争优势。话虽如此,您仍然需要以“推理代币”的形式为这些代币付费。这进一步强调了为什么您需要谨慎使用 OpenAI o1,这样您就不会因为询问内华达州首府在哪里而被收取大量代币费用。

不过,AI 模型能够帮助你“从大想法中逆向而行”,这个想法很强大。在实践中,该模型在这方面做得相当好。

举个例子,我让 ChatGPT o1 preview 帮我的家人策划感恩节大餐,这项任务如果能运用一点公正的逻辑和推理,就会事半功倍。具体来说,我想弄清楚两个烤箱是否足以为 11 个人烹制感恩节大餐,并想讨论一下我们是否应该考虑租用 Airbnb 来使用第三个烤箱。



上传失败,网络异常。

重试

(Maxwell Zeff/OpenAI)


上传失败,网络异常。

重试

(Maxwell Zeff/OpenAI)

经过 12 秒的“思考”,ChatGPT 给我写了一份 750 多字的回复,最终告诉我,只要经过仔细的策划,两个烤箱就足够了,这样我的家人就可以节省开支,有更多的时间在一起。但它为我分解了每一步的思考,并解释了它是如何考虑所有这些外部因素的,包括成本、家庭时间和烤箱管理。

ChatGPT o1 预览告诉我如何在举办活动的房子中优先安排烤箱空间,这很聪明。奇怪的是,它建议我考虑租用便携式烤箱一天。话虽如此,该模型的表现比 GPT-4o 好得多,后者需要多次询问我到底要带什么菜,然后给了我一些我认为不太有用的基本建议。

询问感恩节晚餐可能看起来很傻,但你可以看到这个工具如何有助于分解复杂的任务。

我还请 o1 帮我规划一天繁忙的工作,包括往返机场、多个地点的面对面会议和办公室。它给了我一份非常详细的计划,但可能有点太多了。有时,所有额外的步骤可能会让人有点不知所措。

对于一个简单的问题,o1 做得太多了——它不知道什么时候该停止过度思考。我问你在美国哪里可以找到雪松树,它给出了 800 多字的回答,概述了该国雪松树的每一种变种,包括它们的学名。出于某种原因,它甚至不得不在某个时候咨询 OpenAI 的政策。GPT-4o 在回答这个问题上做得更好,给了我大约三句话来解释你可以在全国各地找到这种树。

降低预期

从某种程度上来说,Strawberry 永远无法达到人们的预期。关于 OpenAI 推理模型的报道可以追溯到 2023 年 11 月,当时每个人都在寻找 OpenAI 董事会为何罢免 Sam Altman 的答案。这在人工智能界引发了谣言,一些人猜测 Strawberry 是 AGI 的一种形式,即 OpenAI 渴望最终创造的人工智能的开明版本。

Altman确认 o1 不是AGI,以消除任何疑虑,但这并不意味着您在使用后会感到困惑。这位首席执行官还降低了对此次发布的预期,他在推特上表示:“o1 仍然有缺陷,仍然有局限性,而且第一次使用时的感觉仍然比花更多时间使用后的感觉更令人印象深刻。”

人工智能领域的其他领域正在接受一个不如预期那么令人兴奋的发布会。

人工智能初创公司 ReWorkd 的研究工程师 Rohan Pandey 表示:“这种炒作有点超出了 OpenAI 的控制范围。”该公司利用 OpenAI 的模型构建了网络抓取工具。

他希望 o1 的推理能力足以解决 GPT-4 所欠缺的一系列小众复杂问题。业内大多数人可能都是这样看待 o1 的,但并不完全是 GPT-4 为行业带来的革命性进步。

“每个人都在等待功能的阶梯式变化,目前还不清楚这是否代表了这一点。我认为就这么简单,”Brightwave 首席执行官 Mike Conover 在接受采访时表示,他曾共同创建了 Databricks 的 AI 模型 Dolly。

这里的价值是多少?

谷歌前员工、风险投资公司 S32 首席执行官安迪·哈里森指出,谷歌在 2016 年使用类似技术创建了 AlphaGo,这是第一个击败围棋世界冠军的人工智能系统。AlphaGo 通过无数次与自己对弈进行训练,基本上是自学,直到达到超人的能力。

他指出,这引发了人工智能领域的一场古老争论。

“第一阵营认为,你可以通过这种代理过程实现工作流程的自动化。第二阵营认为,如果你拥有通用智能和推理能力,你就不需要工作流程了,就像人类一样,人工智能只需要做出判断,”哈里森在接受采访时表示。

哈里森表示,他属于第一阵营,而第二阵营则要求你相信人工智能能够做出正确的决定。他认为我们还没有达到那个程度。

然而,其他人认为 o1 并不是一个决策者,而更像是一个质疑你在重大决策上的想法的工具。

Workera 首席执行官 Katanforoosh 举了一个例子,他要面试一位数据科学家,希望他能来公司工作。他告诉 OpenAI o1,他只有 30 分钟的时间,想评估一定数量的技能。他可以利用人工智能模型进行逆向分析,以了解他的想法是否正确,而 o1 也会了解时间限制等等。

问题是,这个有用的工具是否值得这么高的价格。随着人工智能模型的价格不断下降,o1 是我们长期以来看到的第一批价格上涨的人工智能模型之一。

更多 TechCrunch



First impressions of OpenAI o1: An AI designed to overthink it

Comment


上传失败,网络异常。

重试

Image Credits: David Paul Morris/Bloomberg / Getty Images

OpenAI released its new o1 models on Thursday, giving ChatGPT users their first chance to try AI models that pause to “think” before they answer. There’s been a lot of hype building up to these models, codenamed “Strawberry” inside OpenAI. But does Strawberry live up to the hype?

Sort of.

Compared to GPT-4o, the o1 models feel like one step forward and two steps back. OpenAI o1 excels at reasoning and answering complex questions, but the model is roughly four times more expensive to use than GPT-4o. OpenAI’s latest model lacks the tools, multimodal capabilities, and speed that made GPT-4o so impressive. In fact, OpenAI even admits that “GPT-4o is still the best option for most prompts” on its help page, and notes elsewhere that o1 struggles at simpler tasks.

“It’s impressive, but I think the improvement is not very significant,” said Ravid Shwartz Ziv, an NYU professor who studies AI models. “It’s better at certain problems, but you don’t have this across-the-board improvement.”

For all of these reasons, it’s important to use o1 only for the questions it’s truly designed to help with: big ones. To be clear, most people are not using generative AI to answer these kinds of questions today, largely because today’s AI models are not very good at it. However, o1 is a tentative step in that direction.

Thinking through big ideas

OpenAI o1 is unique because it “thinks” before answering, breaking down big problems into small steps and attempting to identify when it gets one of those steps right or wrong. This “multi-step reasoning” isn’t entirely new (researchers have proposed it for years, and You.com uses it for complex queries), but it hasn’t been practical until recently.

“There’s a lot of excitement in the AI community,” said Workera CEO and Stanford adjunct lecturer Kian Katanforoosh, who teaches classes on machine learning, in an interview. “If you can train a reinforcement learning algorithm paired with some of the language model techniques that OpenAI has, you can technically create step-by-step thinking and allow the AI model to walk backwards from big ideas you’re trying to work through.”

OpenAI o1 is also uniquely pricey. In most models, you pay for input tokens and output tokens. However, o1 adds a hidden process (the small steps the model breaks big problems into), which adds a large amount of compute you never fully see. OpenAI is hiding some details of this process to maintain its competitive advantage. That said, you still get charged for these in the form of “reasoning tokens.” This further emphasizes why you need to be careful about using OpenAI o1, so you don’t get charged a ton of tokens for asking where the capital of Nevada is.

The idea of an AI model that helps you “walk backwards from big ideas” is powerful, though. In practice, the model is pretty good at that.

In one example, I asked ChatGPT o1 preview to help my family plan Thanksgiving, a task that could benefit from a little unbiased logic and reasoning. Specifically, I wanted help figuring out if two ovens would be sufficient to cook a Thanksgiving dinner for 11 people and wanted to talk through whether we should consider renting an Airbnb to get access to a third oven.


重试

(Maxwell Zeff/OpenAI)


重试

(Maxwell Zeff/OpenAI)

After 12 seconds of “thinking,” ChatGPT wrote me out a 750+ word response ultimately telling me that two ovens should be sufficient with some careful strategizing, and will allow my family to save on costs and spend more time together. But it broke down its thinking for me at each step of the way and explained how it considered all of these external factors, including costs, family time, and oven management.

ChatGPT o1 preview told me how to prioritize oven space at the house that is hosting the event, which was smart. Oddly, it suggested I consider renting a portable oven for the day. That said, the model performed much better than GPT-4o, which required multiple follow-up questions about what exact dishes I was bringing, and then gave me bare-bones advice I found less useful.

Asking about Thanksgiving dinner may seem silly, but you could see how this tool would be helpful for breaking down complicated tasks. 

I also asked o1 to help me plan out a busy day at work, where I needed to travel between the airport, multiple in-person meetings in various locations, and my office. It gave me a very detailed plan, but maybe was a little bit much. Sometimes, all the added steps can be a little overwhelming.

For a simpler question, o1 does way too much — it doesn’t know when to stop overthinking. I asked where you can find cedar trees in America, and it delivered an 800+ word response, outlining every variation of cedar tree in the country, including their scientific name. It even had to consult with OpenAI’s policies at some point, for some reason. GPT-4o did a much better job answering this question, delivering me about three sentences explaining you can find the trees all over the country.

Tempering expectations

In some ways, Strawberry was never going to live up to the hype. Reports about OpenAI’s reasoning models date back to November 2023, right around the time everyone was looking for an answer about why OpenAI’s board ousted Sam Altman. That spun up the rumor mill in the AI world, leaving some to speculate that Strawberry was a form of AGI, the enlightened version of AI that OpenAI aspires to ultimately create.

Altman confirmed o1 is not AGI to clear up any doubts, not that you’d be confused after using the thing. The CEO also trimmed expectations around this launch, tweeting that “o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.”

The rest of the AI world is coming to terms with a less exciting launch than expected.

“The hype sort of grew out of OpenAI’s control,” said Rohan Pandey, a research engineer with the AI startup ReWorkd, which builds web scrapers with OpenAI’s models.

He’s hoping that o1’s reasoning ability is good enough to solve a niche set of complicated problems where GPT-4 falls short. That’s likely how most people in the industry are viewing o1, but not quite as the revolutionary step forward that GPT-4 represented for the industry.

“Everybody is waiting for a step function change for capabilities, and it is unclear that this represents that. I think it’s that simple,” said Brightwave CEO Mike Conover, who previously co-created Databricks’ AI model Dolly, in an interview. 

What’s the value here?

The underlying principles used to create o1 go back years. Google used similar techniques in 2016 to create AlphaGo, the first AI system to defeat a world champion of the board game Go, former Googler and CEO of the venture firm S32, Andy Harrison, points out. AlphaGo trained by playing against itself countless times, essentially self-teaching until it reached superhuman capability.

He notes that this brings up an age-old debate in the AI world.

“Camp one thinks that you can automate workflows through this agentic process. Camp two thinks that if you had generalized intelligence and reasoning, you wouldn’t need the workflow and, like a human, the AI would just make a judgment,” said Harrison in an interview.

Harrison says he’s in camp one and that camp two requires you to trust AI to make the right decision. He doesn’t think we’re there yet.

However, others think of o1 as less of a decision-maker and more of a tool to question your thinking on big decisions.

Katanforoosh, the Workera CEO, described an example where he was going to interview a data scientist to work at his company. He tells OpenAI o1 that he only has 30 minutes and wants to asses a certain number of skills. He can work backward with the AI model to understand if he’s thinking about this correctly, and o1 will understand time constraints and whatnot.

The question is whether this helpful tool is worth the hefty price tag. As AI models continue to get cheaper, o1 is one of the first AI models in a long time that we’ve seen get more expensive.

More TechCrunch

Tags

AI, AI chatbot, ChatGPT, gpt-4, TC

Venture

This is how bad China’s startup scene looks now

Connie Loizos


重试

AI

Fei-Fei Li’s World Labs comes out of stealth with $230M in funding

Marina Temkin


重试

Fintech

Fintech Bolt is buying out the investor suing over Ryan Breslow’s $30M loan

Julie Bort


重试

TechCrunch Disrupt 2024

Dave and Varo Bank execs are coming to TechCrunch Disrupt 2024

Mary Ann Azevedo


重试

AI

First impressions of OpenAI o1: An AI designed to overthink it

Maxwell Zeff


重试

Image Credits: David Paul Morris/Bloomberg / Getty Images

Featured Article

Investors rebel as TuSimple pivots from self-driving trucks to AI gaming

Rebecca Bellan 
 
Rita Liao


上传失败,网络异常。

重试

Fundraising

Shrinking teams, warped views, and risk aversion in this week’s startup news

Anna Heim


重试

Startups

Y Combinator expanding to four cohorts a year in 2025

Rebecca Szkutak


重试

Social

Telegram CEO Durov’s arrest hasn’t dampened enthusiasm for its TON blockchain

Rita Liao


重试

TechCrunch Disrupt 2024

A fireside chat with Andreessen Horowitz partner Martin Casado at TechCrunch Disrupt 2024

TechCrunch Events


重试

TechCrunch Disrupt 2024

Vanta’s Christina Cacioppo takes the stage at TechCrunch Disrupt 2024

TechCrunch Events


重试

Security

Fortinet confirms customer data breach

Lorenzo Franceschi-Bicchierai


重试

AI

Meta reignites plans to train AI using UK users’ public Facebook and Instagram posts

Paul Sawers


重试

Apps

Spotify begins piloting parent-managed accounts for kids on family plans

Sarah Perez


重试

Transportation

Waymo robotaxis to become available on Uber in Austin, Atlanta in early 2025

Rebecca Bellan


重试

Apps

Howbout raises $8M from Goodwater to build a calendar that you can share with your friends

Ivan Mehta


重试

Startups

SoftBank-backed Delhivery contests metrics in rival Ecom Express’ IPO filing

Manish Singh


重试

Apps

Alternative app stores will be allowed on Apple iPad in the EU from September 16

Romain Dillet


重试

Government & Policy

Three and Vodafone’s $19B merger hits the skids as UK rules the deal would adversely impact customers and MVNOs

Paul Sawers


重试

AI

Oprah just had an AI special with Sam Altman and Bill Gates — here are the highlights

Kyle Wiggers 
 
Maxwell Zeff


重试

Biotech & Health

XP Health grabs $33M to bring employees more affordable vision care

Marina Temkin


重试

Space

Polaris Dawn astronauts perform historic private spacewalk while wearing SpaceX-made suits

Aria Alamalhodaei


重试

Venture

Keith Rabois says Miami is still a great place for startups, even as a16z leaves

Rebecca Bellan


重试

Social

Meta is making its AI info label less visible on content edited or modified by AI tools

Aisha Malik


重试

Social

Cohost, the X rival founded with an anti-Big Tech manifesto, is running out of money and will shut down

Sarah Perez


重试

Commerce

Shopsense AI lets music fans buy dupes inspired by red-carpet looks at the VMAs

Lauren Forristal


重试

Featured Article

A comprehensive list of 2024 tech layoffs

Cody Corrall 
 
Alyssa Stringer


上传失败,网络异常。

重试

Climate

This startup is making manure out of other biogas power plants and now has $62M to play with

Mike Butcher


重试

AI

ChatGPT: Everything you need to know about the AI-powered chatbot

Kyle Wiggers 
 
Cody Corrall 
Alyssa Stringer


重试

Transportation

Faraday Future gives CEO and founder raises and bonuses after delivering 13 cars

Sean O'Kane


重试


科技世代千高原
透视深度科技化时代™ 探寻合意的人类未来
 最新文章