8/10/2025

Why GPT-5 Still Fumbles with Information Extraction (And How We're Fixing It)

So, GPT-5 is finally out in the wild. The hype was, as always, through the roof. And while it’s undeniably a powerful piece of tech, a lot of us in the trenches are noticing something… it’s still not quite the magic bullet we hoped for, especially when it comes to a task that’s deceptively simple: information extraction.
Honestly, it’s a bit of a letdown for some. Early reports & user feedback are a mixed bag. On one hand, it's faster & better at coding. On the other, it can be less creative, refuse to give detailed answers, & sometimes the new reasoning capabilities just don't kick in when you expect them to. There was even a bug right at launch where an "autoswitcher" broke, making the model seem way less smart than it should have been.
But here's the thing: the struggles of GPT-5 with information extraction aren't really a "GPT-5" problem. It’s a large language model problem. These models, at their core, are built on an architecture that, for all its strengths, has some baked-in limitations that make pulling specific, structured data out of a messy block of text a surprisingly tough job.
Let's get into why this is such a persistent challenge & what we're actually doing to solve it.

The Core of the Problem: Unstructured Brain, Structured World

Think about how you read an article or an email. You're not just processing a string of words; you're instantly identifying key pieces of information. "Who is this email from?" "What's the main point?" "Is there a deadline?" You're extracting structured data (sender, topic, date) from unstructured text.
This is what information extraction (IE) is all about. It’s the process of pulling specific entities like names, dates, locations, & relationships out of plain text & organizing them into a neat, predictable format. It sounds simple for us, but for an AI, it’s a huge hurdle.
LLMs like GPT-5 are designed to understand & generate human-like text. They are masters of context, grammar, & style. But they think in flowing language, not in neat little boxes. The core challenge is the mismatch between their natural language output & the rigid, structured format we need for, say, a database or a spreadsheet.
Here are the big roadblocks we keep running into:

1. The Hallucination & Accuracy Dilemma

This is the big one everyone talks about. LLMs are notorious for "hallucinating" – making things up that sound plausible but are just plain wrong. When you're trying to extract critical data, like a customer's order number or a specific contract clause, "plausible" isn't good enough. It has to be PERFECT.
A primary development focus for GPT-5 was actually to reduce these hallucinations & improve reliability. But the problem is fundamental. The model is a prediction engine. It’s always trying to guess the next most likely word. Sometimes, that guess is a brilliant inference; other times, it's a confident fabrication. For tasks that demand 100% accuracy, this is a massive risk. We're constantly asking: can we trust the output? Adjudicating the differences between what a human extracts & what an LLM extracts is a huge challenge because even the definition of "correct" can vary.

2. Context is Everything… & It’s Hard

LLMs are great with context, but they can also be slaves to it. Their understanding is heavily dependent on the surrounding text. This can lead to a few problems:
  • Ambiguity: Human language is full of it. Does "Apple" refer to the company, the fruit, or Gwyneth Paltrow's kid? For an LLM, distinguishing this without a broader, almost common-sense understanding of the world is tricky.
  • Complex Relationships: Extracting that "Company A signed a deal with Company B" is one thing. Extracting the nuances—the value of the deal, the effective date, the key stakeholders, & the termination clauses, all spread across a 50-page document—is another level of complexity.
  • Variable Formats: Data isn't always presented in the same way. One document might list a date as "Jan. 5, 2024," another as "01/05/2024," & a third might say "the fifth day of the new year." LLMs can struggle to consistently recognize & normalize this kind of variable information.

3. The Black Box & The Transformer's Limits

This gets a bit more technical, but it's SUPER important. Modern LLMs are built on something called the Transformer architecture. It’s revolutionary, but it has its limits.
The core of the Transformer is a mechanism called "self-attention." It's incredible at figuring out how different words in a sentence relate to each other, even over long distances. This is why LLMs are so good at maintaining coherence. But this power comes at a cost.
  • Computational Complexity: The attention mechanism's calculations grow quadratically with the length of the text. In simple terms, if you double the length of the document you're analyzing, the computational effort doesn't just double, it quadruples. This makes processing very long documents, like legal contracts or research papers, incredibly resource-intensive & expensive.
  • Fixed Context Windows: Because of this complexity, Transformers have a fixed "context window" – a limit on how much text they can "see" at one time. While GPT-5 expanded its window to 256,000 tokens, it's still a hard limit. If a crucial piece of information is outside that window, the model is blind to it.
  • It's a Black Box: Honestly, we don't always know why a Transformer model makes a certain decision. They are so complex that tracing the exact path from input to output is almost impossible. This lack of interpretability is a huge problem for applications where you need to be able to explain the reasoning, like in legal or medical fields.

So, How Do We Actually Fix This?

Alright, so the challenges are pretty significant. But it's not all doom & gloom. The industry is getting REALLY clever about working around these limitations. It's less about waiting for a perfect "GPT-6" & more about building smart systems around the models we have.
Here's a look at the strategies that are actually working:

1. Smarter Prompt Engineering & Few-Shot Learning

This is the most immediate fix. Instead of just asking the model to "extract the key information," we give it very specific instructions & examples. This is often called "prompt engineering."
We can provide a "schema" in the prompt, telling the model EXACTLY what fields we want (e.g.,
1 {"company_name": "", "invoice_date": "", "total_amount": ""}
) & the format they should be in. By giving it a few examples of the text & the desired structured output (this is called "few-shot learning"), we can dramatically improve its accuracy for a specific task. Some new techniques even involve dynamically creating the data models, which makes the process way more flexible.

2. Fine-Tuning on Specific Datasets

While massive pre-trained models are great generalists, they can be made into specialists. We can take a base model & "fine-tune" it on a smaller, highly specific dataset. For example, we could fine-tune a model on thousands of legal contracts to make it an expert at extracting clause information. This teaches the model the specific patterns & vocabulary of that domain, which significantly boosts its accuracy & reliability for that niche task.

3. Retrieval-Augmented Generation (RAG)

This one is pretty cool. Instead of trying to stuff an entire library of documents into the model's limited context window, we use a two-step process.
First, we use a simpler, faster method (like a vector search) to find the most relevant snippets of text from a vast database. Then, we feed only those relevant snippets to the powerful LLM. It's like having a research assistant who finds the right pages in a book for you, so you only have to read a few paragraphs instead of the whole thing. This approach helps overcome context window limitations & makes the process way more efficient.

4. Human-in-the-Loop & Multi-LLM Validation

For mission-critical applications, you can't rely on a single AI's output. The solution? Create a workflow.
One common approach is to have a human review & validate the AI's extracted data. This keeps accuracy high & also creates a feedback loop to further train the model.
Another fascinating technique is using multiple LLMs to "challenge" each other. You can have one LLM extract the data, & a second, different LLM review the output. If they disagree, the data point is flagged for human review. This leverages the different strengths & weaknesses of various models to create a more robust system.

How Arsturn Fits Into This Puzzle

This is where things get really practical, especially for businesses. All these challenges & solutions are great in theory, but how do you implement them? This is exactly the kind of problem we're obsessed with at Arsturn.
When a business wants to provide instant, accurate answers to its customers, it's facing an information extraction problem. A customer asks, "What's the status of my order #12345?" They don't want a long-winded paragraph; they want a specific piece of data.
This is where building a custom AI chatbot becomes so powerful. Arsturn helps businesses create custom AI chatbots trained on their own data. We're not just plugging into a generic GPT model. We're building a specialized tool. Here’s how it addresses the core issues:
  • Tackling Unstructured Data: A business's knowledge base—FAQs, product docs, shipping policies—is a classic example of unstructured data. Arsturn helps businesses build no-code AI chatbots trained on this specific data. This means the chatbot isn't guessing based on the entire internet; it's retrieving information directly from the company's own, verified sources. This dramatically reduces hallucinations & improves accuracy.
  • Instant, Structured Answers: When a visitor interacts with a website, they want immediate engagement. For businesses, this is a huge opportunity for lead generation & customer support. Arsturn's platform helps businesses build meaningful connections with their audience through personalized chatbots. These bots can parse a user's question, understand the intent, extract the key entities (like a product name or issue type), & provide an instant, relevant answer or guide them to the right human, 24/7. This boosts conversions & provides a much better customer experience.
The future of information extraction isn't about one giant, all-knowing AI. It's about creating focused, specialized tools that leverage the power of LLMs while mitigating their weaknesses. It’s about building smart systems that can navigate the messy world of human language & pull out the structured data that businesses need to operate efficiently & serve their customers better.
The road to GPT-5 has been a bit bumpy, & it's clear that the pursuit of pure scale is hitting diminishing returns. The real breakthroughs are now happening in how we apply, constrain, & validate these powerful models for specific, real-world tasks.
Hope this was helpful & gives you a better sense of what's really going on under the hood with these new AI models. Let me know what you think

Copyright © Arsturn 2025