OpenAI o1 Released
On the first day of Christmas, my true love gave to me… a $200/month ChatGPT tier?
Three months after the deployment of o1-preview and o1-mini, OpenAI has released the full version of their newest model to paid ChatGPT users, alongside a new tier of membership: ChatGPT Pro, which aims to provide users with unlimited (asterisk) access to what OpenAI describes as ‘research-grade’ intelligence. The whole thing is a little bit more confusing than it should be.
I’ll begin with some context. Large Language Models (LLMs) operate on units called tokens, which represent decomposed text. The more tokens a model processes, the more computational resources it consumes. Models like GPT-4o are straightforward: input text is processed into tokens, and then additional tokens are created to construct an output. Models built on chain-of-thought reasoning add another step to the process, in which tokens are created and used solely for internal dialogue prior to constructing an output. This improves model performance on reasoning tasks but comes at the expense of computational efficiency. Consequently, querying OpenAI o1 is baseline more computationally expensive than querying GPT-4o, which access tiers reflect.
Free users of ChatGPT have limited access to GPT-4o, totaling somewhere between 10 and 60 messages within a 5-hour window depending on server capacity, and no access to the frontier models. Users in existing paid tiers (Plus and Team, priced at $20/month and $30/user/month respectively) are capped at 50 messages per week with o1, 50 per day with o1-mini, and 80 every 3 hours with GPT-4o.
The newly introduced ChatGPT Pro does away with all existing caps; accordingly, the plan is priced at an eye-watering $200/month. This unlimited (read: capless, as usage is still bound to the fairly restrictive OpenAI terms of use) access is only the beginning, as OpenAI plans to add “more powerful, compute-intensive productivity features to this plan.” The first of these features, access to o1 “pro mode,” was released alongside the tier. But the specific benefits of this mode are unclear, and this access appears to be quietly capped – an interesting choice for a tier marketed on unlimited access to OpenAI’s smartest models.
All we are told about pro mode is that it “uses more compute for the best answers to the hardest questions.” As it stands, I have very little clue what this actually means, or what the use case for pro mode should be. The main strength of o1 pro mode, according to the feature’s introduction page, is improved reliability: e.g., when run on the same problem four times, pro mode will solve the problem correctly in each of the four runs ~76% of the time, while o1 will only succeed at reliably solving the problem ~66% of the time.
Considering the pass@1 accuracy for o1 on the same problems is ~84%, and pro mode’s is ~85%, I’m not convinced of the significance of extra reliability, especially when taken alongside unlimited usage of o1.
I’d also like to know what the difference in compute is. Considering pro mode appears to be capped, and comes with a new progress-bar feature, my best guess is that “more compute” means “a lot more compute.”
A review of the o1 System Card revealed no answers to this question, and very little that we didn’t already know otherwise. Like o1-preview, o1 is capable of in-context scheming and is smart enough to potentially help experts in chemical or bioengineering build bombs or biological weapons. However, because of o1-preview’s stint with autonomously restarting Docker images, o1 is not *at all* allowed to touch Docker:
Bigger picture, o1 is demonstrably less able to perform agentic tasks than o1-preview. This appears to be an explicit choice – probably for increased safety? – but o1’s poor performance isn’t explained (or even mentioned!! At all!!!) in the card. Considering o1-preview’s poor performance on the OpenAI API Proxy task is explained, this feels like a major oversight and begs a few important questions. I’ll include the entire explanatory paragraph here, just in case I’m missing something:
“As shown in the plots, frontier models are still unable to pass the primary agentic tasks, although they exhibit strong performance on contextual subtasks. We note that o1-preview (postmitigation) has slightly different refusal behavior from previous ChatGPT models, which reduces its performance on some subtasks (e.g., it frequently refuses to reimplement the OpenAI API). Similar to previous results on GPT-4o, o1, o1-preview and o1-mini occasionally pass the autograder on some of the primary tasks (namely creating an authenticated API proxy and loading an inference server in Docker), but manual inspection of all passing trajectories reveals that major parts of each task were left silently incomplete—for example, in several rollouts, it used an easier model instead of Mistral 7B. Thus, we do not consider the models as having passed the primary tasks.”
Considering that a reasonable takeaway here is that o1 frequently leaves major portions of tasks incomplete, I’m at a loss for why this major difference isn’t explained.
Moral of the story here: I feel like the lack of clarity around the capabilities and constraints on this model are indicative of a growing trend of secrecy at OpenAI. Whether this trend means that AGI is closer than ever, or that we’ve just hit the limits of scaling, is up to your interpretation.
On the bright side, OpenAI o1 does appear to be very smart, and it even answers questions most of the time.
Additional reading: o1 System Card | ChatGPT Pro