8/10/2025

How many 'b's in blueberry?: Why GPT-5 Still Fails at Simple Logic Puzzles

Alright, let's talk about something that's been on my mind a lot lately. We're living in this wild era of AI, with models like GPT-4 and its successors getting scarily good at writing poems, coding websites, & even acing exams. The hype around the upcoming GPT-5 is already through the roof, with whispers of it being a monumental leap in AI intelligence. But here's the thing that trips me up, & maybe you've noticed it too. Ask one of these super-advanced AIs a ridiculously simple question, like "How many 'b's are in the word 'blueberry'?", & you might get a confident, yet COMPLETELY wrong, answer.
It’s a paradox, right? How can something so "intelligent" fail at a task a first-grader could solve in seconds? It makes you question what's really going on under the hood. Honestly, it feels like we've built a world-class race car that can't handle a simple speed bump.
This isn't just a fun little quirk. It points to some deep, fundamental limitations in how these Large Language Models (LLMs) work. And as we start integrating them into everything from customer service to medical diagnostics, understanding these weak spots is more important than ever. So, let's pull back the curtain a bit & explore why these brilliant machines can be so… well, dumb.

The "Illusion of Thinking"

First off, we need to get one thing straight: LLMs don't think like we do. They are masters of pattern recognition, not genuine comprehension. An LLM is a type of deep learning neural network, most commonly built on what's called the Transformer architecture. This architecture is brilliant at processing massive amounts of text—we're talking about a significant chunk of the internet—& learning the statistical relationships between words & phrases.
When you ask it a question, it's not reasoning from first principles. It's making a highly educated guess about what the next word in the sequence should be, based on the countless examples it has seen during its training. This is why they can generate such human-like text; they've learned the patterns of our language better than anyone. But it’s also why they can be so easily fooled. A recent paper from Apple even called this the "Illusion of Thinking," noting that while these models can generate detailed thought processes, they struggle with true logical reasoning, especially as problems get more complex.
Think about it this way: you could memorize every single chess game ever played. That would make you an incredible opponent, able to recall the perfect move in millions of situations. But does that mean you understand the strategy of chess? Not necessarily. You're just recalling patterns. That's essentially what an LLM does. It's a grandmaster of recall, but a novice at true, flexible reasoning.
This leads to some pretty embarrassing, & sometimes serious, real-world failures. We've seen an Air Canada chatbot invent a non-existent bereavement policy, leading the airline to have to honor the fake policy in court. There have been instances of lawyers using ChatGPT for legal research, only to find it had confidently cited completely fabricated legal cases. One researcher even documented an LLM explaining, in great detail, the process of metamorphosis in slugs—something that, you know, doesn't happen. These aren't just glitches; they're symptoms of a system that's built on linguistic patterns, not a grounded understanding of reality.

The Tokenization Problem: Why Counting is Harder Than Calculus

So, back to our "blueberry" problem. Why does this simple task break these complex systems? The answer, in large part, comes down to a single, technical concept: tokenization.
When we see the word "blueberry," we see a sequence of 10 individual letters. An LLM doesn't. To make processing massive amounts of text manageable, LLMs break words down into smaller chunks called "tokens." A token might be a whole word, a part of a word (a subword), or even just a punctuation mark.
For example, the word "blueberry" might be broken down into tokens like
1 ["blue", "berry"]
. The word "indivisibility" might become
1 ["in", "divis", "ibility"]
. The model then operates on these tokens, not on the individual letters. So when you ask it to count the 'b's in "blueberry," you're asking it to perform an operation at a level of detail it's not designed to "see". It’s like asking someone to count all the letter 'a's in a library by only giving them a list of the book titles. They can't do it because they don't have access to the actual text inside the books.
This is a fundamental architectural choice. Processing text character-by-character is computationally inefficient & makes it harder for the model to grasp the semantic meaning of the text. Subword tokenization is a clever compromise, but it creates this blind spot. The model can recognize the concept of counting letters because it has seen millions of examples of people talking about it, but it can't actually perform the counting operation reliably because it lacks the character-level awareness.
Interestingly, this is also why an LLM might be able to solve a complex algebra problem but fail at counting. Algebra problems, while complex to us, are often about recognizing & manipulating well-defined patterns that appear frequently in training data. Counting letters is a procedural, algorithmic task that requires a different kind of processing—one that LLMs just aren't built for.

Emergent Abilities & The Hope for Scale

Now, the big argument from the AI community has always been that these problems will be solved with scale. The idea is that as models get bigger—more parameters, more training data—they will develop new, "emergent abilities." These are skills that don't exist in smaller models but just sort of "pop" into existence once a certain threshold of size & complexity is reached. We've seen this happen with things like arithmetic, question answering, & summarization.
This is the big hope for models like GPT-5. The rumors suggest it will be a "unified system" with vastly improved reasoning & a significant reduction in hallucinations. The idea is that with enough scale, the model will finally move beyond simple pattern matching & start to develop a more robust, generalized understanding of the world. OpenAI's CEO, Sam Altman, has even hinted that GPT-5 will represent a significant leap in intelligence, making current models feel like "an old iPhone."
But here's the catch: emergence is unpredictable. We don't really know which abilities will emerge or when. And there's a growing body of research suggesting that some of these limitations might be fundamental to the transformer architecture itself. One study found that even the newest models exhibit a "complete accuracy collapse" beyond a certain level of complexity. Another found that while LLMs might get better at tasks with more training, they still heavily rely on memorization & fail when a problem is even slightly different from what they've seen before.
So, will GPT-5 finally be able to count the 'b's in "blueberry" reliably? Maybe. But even if it can, it might just be because it has memorized so many examples of that specific question that it can retrieve the correct answer through a more sophisticated form of pattern matching. The underlying logical deficit might still be there, waiting to be exposed by a slightly different, novel problem.

The Business Implications: When "Good Enough" Isn't

This is where things get really practical, especially for businesses. We're all excited about using AI to automate tasks, improve customer service, & drive efficiency. But if the underlying technology has these kinds of logical blind spots, we have to be incredibly careful about how we deploy it.
For a lot of businesses, the first touchpoint with AI is a chatbot on their website. The dream is to have a 24/7 virtual assistant that can answer customer questions, generate leads, & provide instant support. And honestly, this is a perfect use case for AI, as long as it's done right. This is where tools like Arsturn become so important. The reality is, you can't just plug in a generic LLM & hope for the best. You need a platform that understands these limitations & is built to work around them.
Arsturn helps businesses create custom AI chatbots trained on their own data. This is CRITICAL. It grounds the AI in the reality of your business—your products, your policies, your knowledge base. By doing this, you're not relying on the model's generalized, and often flawed, understanding of the world. You're giving it a specific, controlled environment to operate in. This dramatically reduces the risk of the chatbot making up fake policies or hallucinating incorrect product information. It’s about building a no-code AI chatbot that provides personalized customer experiences by sticking to the script you've provided.
When you're talking about lead generation or website optimization, you need a system that can reliably guide a user through a conversation & capture the right information. You can't have it getting confused by simple logic or going off on a tangent. By using a platform like Arsturn, businesses can build conversational AI that creates meaningful connections with their audience, not because it's a super-intelligent AGI, but because it's a well-trained, focused tool designed to do a specific job exceedingly well. It's about leveraging the language capabilities of LLMs while putting guardrails in place to mitigate their inherent weaknesses.

So, Where Do We Go From Here?

The truth is, we're in a weird, in-between phase of AI. The capabilities are undeniably impressive, but the limitations are just as stark. We have models that can write beautiful prose about the feeling of riding a bike without having any concept of balance or momentum. They are, as one paper put it, "models of language, not of reality."
Experts are divided on whether these are just growing pains or fundamental, hard limits of the current technology. Some believe that with more data, better training techniques, & new architectural tweaks, we can overcome these hurdles. Others argue that the very foundation of LLMs—predicting the next token—creates a ceiling that we're already starting to hit. They argue that true reasoning will require entirely new architectures, perhaps ones that are more explicitly designed for logic & symbolic reasoning, or that have a more grounded understanding of the world through sensory input.
For now, we have to be smart about how we use these tools. We need to be both amazed by what they can do & deeply skeptical of their limitations. We need to design systems that play to their strengths—language generation, summarization, pattern recognition—while using other tools & human oversight to handle tasks that require strict accuracy, logic, & common sense.
So next time you hear about the incredible new feats of GPT-5 or some other next-gen AI, take it with a grain of salt. Be impressed, be excited, but also, maybe ask it how many 'b's are in "blueberry." The answer might be more revealing than any marketing announcement.
Hope this was helpful. Let me know what you think.

Copyright © Arsturn 2025