LLM's and the Mirage of Thinking Across Problem Complexity

Pushkar Shukla, Chalapathi Rao, Yash Singh & Aashay Dhruv
November 19, 2025

We are living in a day and age where large language models (LLMs) can seamlessly mimic human thoughts and the boundary between genuine reasoning and sophisticated imitation has become increasingly blurred. As we interact with and refine these AI systems, we're forced to confront profound questions: Do these models truly 'think', or are they merely creating an illusion of thought through complex patterns?  

The rise of Agentic AI further complicates our understanding, prompting us to reconsider our definitions of intelligence and autonomy. By examining the strengths and limitations of reasoning models through the intricate lens of problem complexity, we have looked to demystify how these systems operate, revealing not just the remarkable achievements of artificial intelligence, but also the critical gaps in our comprehension of its inner workings.  

Subtle Patterns, Significant Questions

The challenges and limitations we’ve encountered internally have increasingly mirrored patterns emerging across the wider industry. While not always prominent, they often surfaced as small inconsistencies or subtle shifts in output. For instance, generating nearly identical responses for slightly different contractual clauses. Although minor at first, these patterns grew gradually and became difficult to ignore. They may signal early signs of context degradation, a process which we discuss in further detail later in the article, where the model appears to lose track of earlier details as the input sequence becomes more layered. These issues raise questions about how reliably the model can handle a diverse range of information.  

In response, we began analysing when and how these shifts emerged, as well as what they suggest about the model’s ability to sustain reasoning under pressure. The discussion that follows presents our internal findings alongside insights from emerging research, which is beginning to uncover similar patterns.

Bigger Workloads Don’t Always Mean Better Results

A recent study by Apple [1] found that certain LLMs struggle with task prioritization. Instead of focusing on the most relevant parts of a given task, the models often treat every input equally, leading to inconsistent reasoning in more complex use cases. Taken together with our own industry experience, this highlights an important reality: we are still learning how LLMs behave across different scenarios and in doing so, uncovering both their strengths and their limitations.

Our deployments of Agentic AI across industries such as healthcare, manufacturing, legal services etc, have all confirmed this trend. Performance often declines as complexity increases and the volume of data scales up.

Let us consider contracts as an example. Traditionally, businesses have relied on manual processes to review individual contracts and their numerous clauses, which is a time-consuming and often error-prone task. However, by deploying AI Agents, which consist of multi-agent setups involving orchestrated LLMs with distinct responsibilities, organisations can automate this rigorous review process.

We actively help businesses implement agentic AI solutions, ensuring a smooth transition from manual review to automated, intelligent contract analysis. When implemented, these agents assess each contract against a predefined playbook, ensuring accuracy and compliance.

Memory Refresh and Task Segmentation

During our implementation, we found that the AI agent handled the review of 4–6 clauses with ease. However, when tasked with reviewing a larger volume of around 15–20 clauses, its accuracy declined noticeably. This drop in performance was accompanied by a loss of context, particularly when the model needed to recognize subtle variations in clauses expressed with different wording styles.

To address this issue, we explored a strategy of memory refreshing, periodically resetting the model by clearing its memory and reintroducing key information in a simplified format. This approach loosely parallels techniques such as retrieval-augmented generation (RAG) or sliding windows, which are designed to manage memory and maintain relevance over extended tasks.

For example, in contract review, resetting after every 5 clauses and feeding the model concise summaries of the most recent ones proved effective. This light reset reduced the cognitive burden on the model and helped it sustain accuracy across longer, more repetitive documents particularly in cases where small differences in wording carried significant meaning.

Beyond accuracy, we also observed an even subtler limitation: the model’s ability to retain and differentiate information across extended inputs. This became especially evident when reviewing clauses with overlapping language, where the AI sometimes blurred distinctions between similar but functionally different provisions.

Over time, we noticed that the model often lost clarity about what it had already processed. It sometimes blended earlier inputs with later ones or produced repetitive outputs. This became particularly evident when outputs needed to capture small but important distinctions, for example, two clauses serving different legal functions but written in similar language.

This pattern reflects what James Howard [2] describes as “Context Degradation Syndrome.” In his study, he explains how models can gradually lose track of the original “plot” as the input sequence extends, leading to a decline in both relevance and precision.

Building A Multi-Agent Workflow

As with previous cases, light memory refresh techniques such as, reintroducing summaries at regular intervals can help optimize the model during longer tasks, as the study also noted. However, in situations with very high information density, we found that more structured approaches were often more effective.

One approach we explored was breaking down contracts by assigning specific ranges of clauses to individual agents, rather than giving a single agent responsibility for the entire document. For instance, clauses might be divided into batches of 1 to 4, 5 to 10, and so on, with each agent focusing only on its assigned section. This method reduced the context burden and helped preserve both clarity and consistency while still aligning with the playbook.  

Interestingly, this approach reflects the research of Zhang et al. [3], who proposed the chain-of-agents framework. Their method assigns each agent a specific function, such as planning, verifying, or summarising and the outputs are passed between agents in a structured workflow. By isolating responsibilities and breaking down large tasks, the system was better equipped at managing complexity.

A case study by Evgeni Rusev [4], also demonstrated the value of this approach. In his example, a multi-agent setup streamlined complex document workflows by using an orchestrator agent to delegate tasks. For instance, one agent handled employment-related clauses while others managed compliance checks.

This type of workflow has the potential to reduce document review time by up to 80% without sacrificing accuracy. It underscores the value of distributing responsibility across specialised agents, a direction that may prove critical as we rethink how to build scalable, reliable AI systems.

Powering LLMs Through Structured Environments

As LLMs continue to evolve and integrate deeper into workflows, we’ve found that performance improvements may not solely rely on scaling but also on the environments we build around them. Based on our observations, even subtle design shifts may offer more stability when navigating complex, high-stakes scenarios. Across the industry, we’re seeing momentum build toward more capable, smarter models.  

Rather than solving complexity through scale alone, the future may lie in how precisely we shape these systems to reason more clearly and how closely we examine the finer details of how they operate.

References

1. Howard, J. (2024). Context Degradation Syndrome: When Large Language Models Lose the Plot. Retrieved from https://jameshoward.us/2024/11/26/context-degradation-syndrome-when-large-language-models-lose-the-plot

2. https://arxiv.org/abs/2406.02818? Chain of Agents: Large Language Models Collaborating on Long-Context Tasks

3. https://www.businessinsider.com/big-law-top-10-firms-ai-overhaul-use-cases-2025-7?

4. Evgeni Rusev, [4] https://medium.com/%40evgeni.n.rusev/multi-agent-ai-with-hybrid-search-cutting-document-review-time-by-80-f7367a9b1361

5. https://machinelearning.apple.com/research/illusion-of-thinking

Our Partners

Shaping the future with our trusted partners

Explore

EXPLORE

EXPLORE

Contact

Get in Touch

Want to find out more about how our services can transform your business? Fill out the form on the right and our team will get back to you as soon as possible.

Want to find out more about how our services can transform your business? Fill out the form below and our team will get back to you as soon as possible.

UK

US

Singapore

India