GPT-5 is just an upgrade; it hardly qualifies as a quantum leap?

Late at night, I powered up my computer and tuned in to OpenAI’s live stream, eagerly anticipating that GPT‑5 might bring something different—perhaps multimodal capabilities and a more agentic functionality…

An image to describe post

There’s certainly an improvement, but not as exaggerated as Ultraman made it seem. Before the release, he shared what appeared to be the Death Star from Star Wars. Isn’t that a statement saying GPT‑5 represents a quantum leap rather than an incremental upgrade? Sam’s marketing is on point, and this wave of suspense has sparked extensive discussion.

An image to describe post

At this point, it seems like pure showmanship, hhh.

First, a Recap of the Live Stream

The first highlight in GPT‑5 is its unified system and access strategy—which essentially means bundling all of OpenAI’s previous models into one, an “all in one” solution.

An image to describe post

In other words, when you use GPT‑5, you won’t have to worry about which model to choose for your next task (though in ChatGPT you can’t select a model anymore—the other model options are gone).

The official introduction states that GPT‑5 is composed of an “efficient main model,” a “deeper thinking model (GPT‑5 thinking),” and a “real‑time router.” The router decides whether to call on the thinking model based on signals such as task complexity, tool requirements, and explicit intent (for example, if the prompt says “think carefully”). Once a quota is reached, it automatically switches to the mini version.

It appears that deployment to all users has already begun today (excluding enterprise and educational users). Pro users can opt for GPT‑5 pro (offering longer reasoning and higher reliability).

The second key feature showcased was its coding ability—which is also how I typically use GPT (Claude wins first 😁). In the live demo, it scored SWE‑bench Verified 74.9% and Aider Polyglot 88%. The aesthetics and implementation quality of frontend code generation have significantly improved, with internal preference tests showing that GPT‑5’s frontend output was preferred 70% of the time compared to O3.

They demonstrated several tasks with impressive results—for example, “generating an educational page related to the Bernoulli effect” and Ultraman’s shared demonstration on X for “generating a music production website.”

An image to describe post

The showcase from buyers looks pretty good (as most companies tend to present), although Claude’s latest Opus 4.1 can also generate impressive results in one go.

An image to describe post

Other aspects include instruction following and tool usage. The τ²‑bench (for telecommunication tool usage scenarios) at 96.7% indicates that multi-turn, multi-tool parallel execution is more stable, with the model able to output a plan/briefing of progress before or during tool invocation.

In terms of context, the API now boasts a total context of 400k tokens (input ≤272K, inference/output ≤128K). In-house evaluations using OpenAI‑MRCR and BrowseComp Long Context demonstrate significant improvements in long document retrieval and cross-turn reasoning.

For writing, the official statement noted that GPT‑5 has reached new heights, transforming abstract intentions into text with rhythm and vivid imagery.

An image to describe post

They described it as follows: GPT‑5’s responses end with a stronger conclusion, featuring clear imagery and striking metaphors (such as “the black banner of a bygone nation” and “the bells of Kyoto rolling twilight down the hills”), creating a more dramatic emotional arc and establishing a vivid cultural and geographic sense. In contrast, GPT‑4o’s responses follow a more predictable structure and rhythm, simply narrating rather than showing (“she cries but does not confide”).

Developers are particularly concerned with pricing. This time, the cost isn’t very high—input is half that of GPT‑4o and even lower than O3’s, not to mention compared with Claude.

An image to describe post

However, for large-scale projects, caching might be a key cost-saving lever (as Manus previously shared how their team used KV-cache to reduce costs when building agents). The input price for cache is lower than the other two, though the output cost is about the same.

It’s amusing—I'm beginning to wonder if the bar chart shown during the live stream was generated by GPT‑5. A height of 52.8 would be greater than 69.1; it’s a bit theatrical.

An image to describe post

During the launch event, Ultraman even invited a user—a cancer patient—to explain how, within a week of her diagnosis and facing multiple pathology reports, she used ChatGPT to translate technical medical jargon into understandable language, extract key information, and prepare questions for her doctor. She also discussed how, when treatment paths (such as whether to opt for radiotherapy) diverged, she leveraged ChatGPT to weigh the pros and cons and achieve “a more confident decision.”

An image to describe post

This segment was meant to convey that OpenAI has made “health” one of the key areas of capability enhancement, and the official stance is that “GPT‑5 is the best model to date for health-related Q&A.”

An image to describe post

During the live event, I saw comments from people saying that their mothers had done something similar upon diagnosis.

The official line is that ChatGPT is not meant to replace medical professionals but rather serve as a partner to help you understand results, prepare questions for your doctor, and weigh your options. As one viewer who is a doctor remarked, he experimented with it and found that GPT’s answers vary with even slight changes in your prompts, so having it substitute for a medical professional isn’t very accurate—but you can use it to help address some of your concerns.

A Look at the LMArena Leaderboard Data

Almost simultaneously with the live stream, this leaderboard was released with test results. A quick note about LMArena: it’s a community-initiated, anonymous leaderboard started by a team at UC Berkeley.

An image to describe post

In short, you ask a question on the platform, and two anonymous models generate responses, after which you vote. It’s essentially blind testing and crowdsourced voting, letting users decide which answer is better. This leaderboard is based on user voting rather than professional evaluations.

An image to describe post

The project’s original intent was to let everyone chat with two anonymous models simultaneously in real conversation (an A/B comparison), evaluating which one feels more natural rather than like an exam. If you want to try GPT‑5, there’s a Direct Chat option available to test the model temporarily (this one isn’t anonymous).

There’s also a comparison mode, where we can pit GPT‑5 against Gemini 2.5 Pro.

When might you use this? For example, if you want to subscribe to an AI service but can only choose one, you can try both and see which one works better before deciding.

On this website, you can see rankings for all major large language models—a general overview of rankings across various categories, with GPT‑5 ranking first or tied for first in almost every domain.

In the overall leaderboard, there are many categories. Essentially, they slice “text” into different themes, categorizing dialogues by topic and then scoring them according to their labels.

For example, in “Battle” mode, you might be asked a question five times, with a score awarded each time. The average of these scores becomes the overall rating. Each question could represent a distinct theme, with its own score.

An image to describe post

Switching to the detailed leaderboard—for instance, for general text dialogue and frontend web application generation—the scores show that GPT‑5’s text score is about 20 points higher than Gemini 2.5 Pro.

GPT‑5 falls within a range of 1470–1492, while Gemini 2.5 Pro is in the 1454–1466 range. Since these ranges do not overlap, statistically speaking, GPT‑5 is significantly ahead.

An image to describe post

The following WebDev page shows the assessment of frontend web generation ability. GPT‑5’s score range is 1466–1495, while Gemini 2.5 Pro is roughly 1395–1412. Again, the non-overlapping ranges indicate GPT‑5’s significant lead. Although I generally prefer Claude for code, Claude 4.1 Opus isn’t even listed here—it’s hard to say how it would compare.

An image to describe post

The third section is the multimodal dialogue leaderboard with images. GPT‑5, Gemini 2.5 Pro, and GPT‑4o overlapped, tying for first. However, at the higher score range, GPT‑5 edged out slightly—this might be because the vote count was low, and users, perhaps influenced by a novelty bias, awarded it high scores, resulting in a statistically significant difference at the 95% level.

An image to describe post

Of course, as mentioned earlier, this leaderboard is based on human preferences and reflects a product experience ranking rather than objective accuracy or academic benchmark scores. For example, most users are ordinary people who may not have deep insights into certain professional fields. Their votes reflect which response “sounds better” rather than which one is correct.

I can’t help but wonder if there’s potential for leaderboard manipulation.

For instance, OpenAI’s models embed text watermarks—could someone use this watermark to identify which model generated a response and then game the leaderboard with each vote? Just a curious thought (not targeting anyone, hhh).

Early Testers’ Reviews

The leaderboard only shows data; actual experiences depend on the testers’ impressions and insights. Matt Shumer obtained early access to GPT‑5 on July 21 and felt it was the model that would finally move vibe Coding graduates into real-world applications.

In other words, GPT‑5 represents a huge leap in programming capabilities.

An image to describe post

Initially, he observed that it was faster and sharper—albeit only a gradual improvement over GPT‑4.1 or Claude 4 Opus in regular writing, general Q&A, and everyday office tasks. In those areas, it’s simply faster and more reliable, not a qualitative leap. Therefore, he felt it was more like GPT‑4.2 than a true GPT‑5.

However, his opinion shifted when he spent one hour developing a product prototype that he initially thought would take weeks.

After discussing a complex new product featuring a highly integrated frontend with scalable GPU-backed backend lifecycle management, he handed the specification document to GPT‑5—and within an hour, it produced a workable prototype. His colleagues were left in awe.

An image to describe post

They believe GPT‑5 has an exceptional grasp of frontend code, so they compared the performance of GPT‑4o, GPT‑4.5, O3, and GPT‑5 in cloning the ChatGPT user interface.

An image to describe post

You can observe how well GPT‑5 replicates the ChatGPT UI and some icons.

That said, the author also mentioned some shortcomings of GPT‑5. Its explicit deep-dive search ability isn’t as strong as O3’s. O3 consistently drills down to extract details, whereas GPT‑5 tends to settle at one level without digging deeper.

— Implicit retrieval: looking up documentation/library APIs during coding.
— Explicit deep-dive: extracting facts down to the finest granularity.

For creative, emotionally charged texts, many authors still prefer GPT‑4.5 (many seem to think that texts written by GPT‑4.5 don’t feel as “AI-generated”).

Another point is that GPT‑5 is quite sensitive to the structure of prompts. It’s not that its instruction-following ability is weak, but it can be overly sensitive. Sometimes, when you use complex prompts, it tends to take creative liberties. Therefore, you need to add constraints specifying that no additional tasks should be appended; otherwise, it might deviate too much—perhaps becoming a bit too expansive?

Over the past few days, I’ve seen many users share their experiences. Initially, like the author, some felt it was rather ordinary, but later there were surprising moments—a noticeable contrast.

Why this shift in perception? It might be because of the “unified system and automatic routing” setup. For everyday, light tasks, the routing favors the model that responds immediately, so the subjective difference isn’t significant; but for more complex tasks that require invoking the thinking toolchain, its advantages become pronounced since it can allocate the necessary models for the task.

Another team also received early access—the Every team. I subscribed to their newsletter (highly recommended) and immediately received an email notification, including an invitation to join their Zoom live stream that same day.

An image to describe post

Their feedback was that ChatGPT is well-suited for daily use; it can rapidly provide structured answers and seamlessly switch between quick responses and deep thought. However, for multi-agent setups and long-term autonomous coding, it may not operate as continuously as Claude Code. GPT‑5 is more like a reliable, obedient pair-programmer rather than a fully autonomous agent.

In programming, the team categorized tasks finely.

First is pair programming—similar to how we use Cursor. Its strength lies in debugging, incremental small-step implementation, and precise attention to detail (with meticulous spacing, state, and boundary condition management), earning high marks from perfectionists. Its downside, as mentioned earlier, is that it isn’t well-suited for prolonged autonomous operations; it tends to pause, waiting for you to continue, and its standalone execution in Codex/Cursor isn’t as robust as Claude Code.

However, opinions vary widely within the team. Some argue it excels at backend tasks, while others say it’s prone to freezing and its out-of-the-box performance isn’t as good as Claude’s. This likely reflects that its stability isn’t yet consistently strong across different tasks.

The second category focuses on frontend UI.

They believe GPT‑5 can produce UIs that more closely resemble human craftsmanship, with high success rates in interaction and layout design. In comparisons with Opus 4 and 4.1—for instance, in mini-game development, GPT‑5’s output is less prone to crashing but can be a bit dull, whereas 4.1 offers more engaging play; in music production apps, GPT‑5 is more usable, but Opus 4’s design is more visually appealing.

An image to describe post

In research and retrieval, especially explicit deep-dive tasks, the team’s view aligns with Matt Shumer’s—that they prefer O3, as it continues to drill down until the smallest unit of fact is retrieved.

Comments and Reactions Online

Elon Musk was quick to defend his own model, wildly reposting and praising posts claiming Grok4 is superior to GPT‑5.

An image to describe post

When GPT‑5 was released, the leaderboards compared it only with its previous iterations—for example, “Human’s last exam” was a self-comparison, which, if there’s no progress, is indefensible.

An image to describe post

Then someone brought up the leaderboard from Grok’s launch for comparison. It’s no wonder GPT‑5 was only compared against itself—this gave Elon a perfect opportunity to pounce.

An image to describe post

Elon took the opportunity to claim that Grok4Heavy had long been superior to the current GPT‑5 and that it would continue to improve—his trash-talking skills are impressive.

An image to describe post

I also looked through various forums to gather user experiences.

Among the general public, the next-generation model was expected to deliver larger context windows, truly autonomous agents, humanlike voice, and photographer-level image quality—everything faster, stronger, and more intuitive.

Where there are high expectations, there can be disappointments. On Reddit’s r/OpenAI, a post titled "GPT‑5 is horrible" quickly garnered over 3,200 upvotes and 1,400 comments. Those numbers alone speak volumes—when a critical post attracts so much attention, it clearly struck a nerve. Users complained bluntly: “The responses are shorter, not thorough enough, feel overly AI-like, lack personality, Plus users hit the usage limit after just one hour… and we can’t even choose another model.”

An image to describe post

That very metric underscores the issue—when a criticism post gains such traction, it obviously resonates with many.

Users’ complaints were straightforward: “The responses have become shorter and less detailed, they’re too AI-like with less personality, Plus users reach their hourly limit too quickly… and we don’t even have the option to choose another model.”

This disappointment is spreading across social media. Many feel betrayed—OpenAI had promised in June that GPT‑4o would remain available even after GPT‑5’s launch, with the new model serving only as an extra option.

But in reality, Plus users are forced to use GPT‑5, capped at 200 messages per week, with no option to revert to the familiar older model. It really feels like they’re being pushed to pay up...

The tech community is more focused on whether GPT‑5’s advances in reasoning and scientific tasks represent a genuine breakthrough.

On Hacker News, one user argued that we’re merely training software to “copy and remix human knowledge frozen at a fixed point in time.” Neural networks, the argument goes, face diminishing returns across all domains, meaning that models from different companies are converging in capabilities.

This view echoes sentiments on Reddit’s technical boards: many question whether GPT‑5 can improve over GPT‑4.5 by more than 5% in long-context and complex reasoning tasks, with some even speculating that any significant improvements might be due to “training data leakage” rather than an algorithmic breakthrough.

An image to describe post

One particularly amusing post I came across described how a woman, after arguing with her boyfriend, ended up treating GPT‑4o as her partner—and when the model’s performance degraded, she felt as though she had lost her boyfriend 🤣.

An image to describe post

Of course, not all feedback is negative. Some users remarked, “GPT‑4 is impressive, but GPT‑5 makes it feel like it has transformed from a tool into a partner.”

However, recently most of the feedback seems negative. I suspect mainstream tech media might further amplify this negativity, creating a feedback loop of poor impressions. I still think it’s worth trying out for yourself (for instance, on LMArena, where the API price has dropped and many apps offer free trials) so you can gauge its performance based on your individual tasks.

#AI #integration #evolution