Why GPT-5 Fails Physics & How to Improve AI Reasoning

8/12/2025

So, GPT-5 Stumbles on Basic Physics? Here’s Why & How We Can Help It Think Better

You’ve probably seen the headlines or maybe even experienced it yourself: the shiny new GPT-5, hailed as a major leap in AI, sometimes gets tripped up by problems you’d find in a high school physics textbook. It’s a little weird, right? You’ve got this incredibly powerful model that can write poetry, debug code, & even have surprisingly deep conversations, but ask it a seemingly simple physics question, & it might just fall flat on its face.

Honestly, it’s a fascinating problem. A physics student on Reddit shared their experience testing GPT-5 with Tsiolkovsky's rocket equation. It’s a classic, but GPT-5 made a simple sign mistake that would have sent the rocket plummeting to the ground. Even after getting a second chance, it fumbled the notation again. This isn’t an isolated incident. AI critic Gary Marcus pointed out that almost immediately after its launch, GPT-5 was showing "flawed physics explanations." Some users on platforms like Reddit have been even more blunt, with one saying it’s been "gutter water level shit" for their physics research.

So what gives? Is GPT-5 dumber than we think? Not exactly. The problem isn’t about raw knowledge. It’s about reasoning. Here’s the thing: the way these large language models (LLMs) "think" is fundamentally different from how a human physicist thinks. & understanding that difference is the key to unlocking their true potential.

The Real Reason LLMs Like GPT-5 Flunk Physics 101

It boils down to a core concept: LLMs are masters of pattern matching, not genuine logical reasoning. They’ve been trained on an unfathomable amount of text & code from the internet. When you give them a prompt, they don’t understand the physics principles in the way a student does. Instead, they are making a highly educated guess about what word should come next based on the countless examples they’ve seen before.

Think of it like this: you can memorize every line of Shakespeare, but that doesn't mean you can analyze a new play with the same depth as a literature professor. The LLM has memorized the "lines" of physics problems, but it doesn't have an intuitive grasp of the underlying "plot" – the laws of the universe.

This leads to a few major roadblocks when it comes to physics problems:

1. They Don't Really Get the Concepts: An LLM might be able to spit out the definition of Newton's second law, but it doesn’t have a mental model of forces, mass, & acceleration. It doesn’t understand that a sign error in an equation means the rocket goes the wrong way. It's just manipulating symbols based on patterns. This is why a simple transitive logic problem like "Jane runs faster than Joe, & Joe runs faster than Sam. Does Sam run faster than Jane?" can trip up an LLM. It doesn't reason through the logic; it just echoes what it thinks is the most statistically likely answer.

2. They're Brittle – Small Changes Cause Big Problems: Researchers have found that if you take a math problem an LLM can solve & just slightly change the numbers or add a bit of irrelevant information, its performance can plummet. This is a huge red flag. It shows the model is relying on familiar problem structures it saw in its training data. A real physicist, on the other hand, can adapt their knowledge to a completely new scenario because they understand the core principles. This "distribution shift" problem is a known Achilles' heel for LLMs.

3. Multi-Step Reasoning is a Minefield: Physics problems often require a chain of logical steps. You need to identify the right formula, plug in the right values, solve for a variable, & then maybe use that result in another formula. Each step is a potential point of failure for an LLM. A recent study introducing a benchmark called PhysReason found that even top-performing models struggle as the number of reasoning steps increases. The study identified key bottlenecks: applying physics theorems correctly, understanding the physical process, performing calculations, & analyzing the conditions of the problem. Another benchmark focusing on university-level physics found that even the best models could only achieve about 60% accuracy, highlighting the massive challenge.

4. Overconfidence is a HUGE Issue: One of the most frustrating things users have noted about GPT-5 is its overconfidence. It will generate a completely wrong answer but present it with the swagger of a Nobel laureate. It might even suggest additional, more complex calculations when it hasn't even gotten the first part right. This is because the model is designed to be helpful & generate fluent-sounding text, not to express uncertainty or doubt.

This is a critical point for businesses thinking about using AI. If you're building a customer-facing tool, you can't have it confidently giving out wrong information. That’s where specialized platforms come in. For instance, if a company wants to provide reliable customer support, they need an AI that's trained specifically on their own data & knowledge base. This is where a solution like Arsturn becomes pretty cool. It helps businesses create custom AI chatbots trained on their specific documentation, product info, & FAQs. This means the chatbot isn't just guessing based on the whole internet; it's providing answers based on a controlled, accurate set of data, leading to far more reliable & trustworthy interactions with customers 24/7.

So, How Do We Make These AI Brains Better at Physics?

The good news is that a ton of smart people are working on this exact problem. It's not about just throwing more data at the models. It’s about teaching them how to reason. Here are some of the most promising techniques being explored:

Better Prompting: The "Explain Your Work" Method

It turns out that how you ask the question matters. A LOT.

Chain-of-Thought (CoT) Prompting: This is a game-changer. Instead of just asking for the final answer, you add a simple phrase like "Let's think step by step." This simple instruction forces the model to break down the problem & show its work. It mimics how humans reason through a problem, & it dramatically reduces errors by making the process more structured.
Self-Consistency: This is the AI equivalent of double-checking your work. You ask the model to solve the same problem multiple times using different reasoning paths. Then, you take the most common answer as the correct one. This helps to weed out answers that were the result of a single faulty line of reasoning.
Tree of Thoughts (ToT): This is a more advanced version of CoT. It prompts the model to explore multiple reasoning paths simultaneously, like branches of a tree. It can then evaluate these different paths & decide which one is the most promising, or even backtrack if a certain path leads to a dead end. It’s a much more dynamic & robust way of thinking through a problem.

Smarter Architectures: Building Better Brains

Beyond just prompting, researchers are looking at fundamentally changing how these models are built & used.

Generating Code: One really clever approach is to have the LLM write a small Python program to solve the problem instead of trying to do the math itself in natural language. LLMs are surprisingly good at coding. By offloading the calculation to a programming language, you eliminate the risk of simple arithmetic errors & ensure the logic is sound. This also makes the process more transparent because you can look at the code to see exactly how the model "reasoned."
Retrieval-Augmented Generation (RAG): This technique gives the LLM a "cheat sheet." Before answering a question, the model first searches a trusted knowledge base (like a physics textbook or a set of research papers) for relevant information. It then uses that retrieved information to construct its answer. This helps to ground the model's response in factual data rather than just its internal, sometimes flawed, knowledge. This is exactly the principle that makes a tool like Arsturn so effective for businesses. By building a no-code AI chatbot trained on their own data, companies are essentially using a hyper-focused version of RAG to ensure their AI is providing personalized & accurate customer experiences, which is great for boosting conversions & building trust.
Specialized Fine-Tuning: The "one model to rule them all" approach might not be the future. We're seeing more work on fine-tuning models on specific, high-quality datasets for particular domains, like physics or math. This involves training the model on curated sets of problems & solutions to build a more robust & specialized reasoning capability.
Representation Engineering: This is a bit more on the cutting edge. It involves looking inside the model's "brain" – its activation layers – to see what's happening when it's correctly solving a problem. Researchers can then create "control vectors" to nudge the model's internal state in the right direction when it's given a new problem, improving its performance without any extra training.

The Road Ahead

The fact that GPT-5 isn’t a perfect physics genius right out of the box isn’t a failure; it’s a signpost. It shows us the limits of the current paradigm of just scaling up models. The future of AI isn't just about more data; it's about better reasoning.

We're moving from models that are just incredible mimics to models that can be guided to think in a more structured, logical, & reliable way. For those of us using AI in our daily lives or for our businesses, it’s a reminder that these are powerful tools, but they are still tools. Understanding their strengths & weaknesses is the key to using them effectively. Whether it’s through clever prompting, using specialized platforms for specific tasks, or embracing the next generation of reasoning-focused models, we're slowly but surely teaching the machine not just to talk, but to think.

Hope this was helpful & gives you a better picture of what's going on under the hood! Let me know what you think.