A research-backed look at how the AI systems in widest use today actually compare - and what the evidence says about the path forward.

Gartner surveyed IT application leaders in mid-2025 and found that only 15% were considering, piloting, or deploying fully autonomous AI agents [1]. The same research predicts over 40% of agentic AI projects will be canceled by the end of 2027 - due to escalating costs, unclear business value, or inadequate risk controls [2].

That gap - between the marketing reality and the deployment reality - starts with a terminology problem. The word "agent" has been attached to almost everything. A chatbot that remembers your name is now an agent. A chatbot with a search tool is now an agentic AI system. A workflow that routes a form submission through three automated steps has been rebranded as an autonomous agent.

When everything is called an agent, nothing is. Buyers cannot tell which systems can actually run a workflow end-to-end without someone watching, and which ones will confidently produce the wrong answer and wait for the human to catch it.

This post builds a working taxonomy of the AI systems in widespread use today, grounded in the definitions the research community has spent thirty years developing. Then it examines why even the most sophisticated autonomous agents in production fall short - not in theory, but in measured benchmark data - and what the evidence says about the right architecture path forward.

What Actually Makes Something an Agent?

The academic definition of an agent predates the LLM era by several decades. In their foundational 1995 paper, Michael Wooldridge and Nick Jennings defined an intelligent agent as a system capable of four properties: autonomy (operating without direct human intervention), social ability (interacting with other agents and systems), reactivity (perceiving and responding to environmental changes), and proactiveness (exhibiting goal-directed behavior rather than waiting for instructions) [3].

Russell and Norvig's canonical textbook Artificial Intelligence: A Modern Approach formalized this as the perceive-decide-act loop: an agent senses its environment, makes decisions, and executes actions that change that environment [4]. The PEAS framework - Performance measure, Environment, Actuators, Sensors - provides a structured way to characterize any agent system.

Under these definitions, autonomy is not a feature - it is the defining characteristic. For the purposes of this taxonomy, a system that requires human intervention at every decision point is not an agent. It is a tool.

The problem is that LLMs have made every tier of the spectrum feel more capable than it structurally is. A session assistant that generates polished prose on command feels autonomous because the output looks intentional. But if the conversation ends and the slate is wiped clean - and if the system cannot take action in the world without a human copying the output somewhere - it has not met the Wooldridge-Jennings bar.

A Working Taxonomy

The following five tiers describe the AI systems in widest use today. Each tier adds a meaningful architectural capability that the one below it lacks.

These tiers are not a simple ladder of business value or maturity. They describe distinct architectural modes - particularly around persistence, autonomy, and how decisions are constrained. A deterministic Tier 4 workflow may deliver more business value than a Tier 5 autonomous agent for many tasks. The distinction here is structural, not a ranking.

Tier 1: Session Assistants

Examples: ChatGPT (web), Claude.ai, Gemini, Gemini Gems

Session assistants are the most common AI systems deployed today. They perceive text - and increasingly images, audio, and documents - generate responses, and in some cases can call a limited set of tools within a conversation window.

What they are not is persistent or autonomous. When the session ends, they retain no memory of what happened. They cannot act on the world between conversations. They cannot schedule work, monitor conditions, or execute multi-step tasks without a human actively managing every exchange.

Gemini Gems and similar "customized GPT" products add a persona layer and system prompt but do not change the fundamental architecture. A Gem is a better-configured session assistant. It is not an agent.

The structural constraint: No persistence across sessions. No action without human initiation. Every task requires a human to open a window, write a prompt, receive a response, and decide what to do with it.

Under the Wooldridge-Jennings definition, session assistants demonstrate reactivity but limited proactiveness, and no meaningful autonomy between interactions.

Tier 2: Copilots

Examples: Claude Cowork (Anthropic), GitHub Copilot Chat, Microsoft 365 Copilot

Copilots advance the model in one important way: they are embedded in the user's actual operating environment. Claude Cowork runs on the desktop with access to files, applications, and screen state. GitHub Copilot lives inside the IDE. Microsoft 365 Copilot is embedded in Word, Excel, and Outlook.

This integration means the copilot can take contextual actions, not just generate text. But the interaction model remains reactive and human-driven. The copilot waits to be invoked, executes what is asked, and reports back. The human decides what happens next and what gets done with the result.

Copilots reduce the friction of human work rather than removing the need for a human to direct it.

The structural constraint: Human-initiated and session-scoped. The human provides direction; the copilot executes within the session.

Tier 3: Task Agents

Examples: Claude Code, Devin (Cognition AI), OpenAI Operator

Task agents represent a genuine step toward autonomy. They receive an objective - not just a prompt - and pursue it across multiple steps, using tools and making intermediate decisions without the human approving each one.

Claude Code is the clearest current example: given a codebase and a goal, it reads files, writes code, runs tests, debugs failures, and iterates - often completing multi-step engineering work end-to-end without the human involved in intermediate steps.

The constraint is scope and memory. Task agents operate within a bounded domain for the duration of a bounded task. When the task ends, the agent's context does not carry forward meaningfully. There is no accumulating knowledge about the business, the preferences of the person they work with, or the history of past decisions. Each task starts largely from scratch.

SWE-bench, a benchmark using real GitHub issues from major software repositories, measured task agent performance on realistic software engineering tasks. On the full test set without scaffolding, the top model at paper publication resolved 1.96% of issues. With sophisticated agent scaffolding, top systems reached approximately 20-43% on the benchmark's "lite" subset - a curated sample of more tractable tasks [5]. These numbers are not directly comparable: base model performance, scaffolded agent performance, and lite-subset performance measure different conditions. Taken together, they show the gap between impressive demos and consistent production performance is measurable and significant.

The structural constraint: Bounded domain, non-persistent memory, limited by task horizon.

Tier 4: Workflow Automation

Examples: n8n, Zapier, Make (formerly Integromat)

This tier is frequently mislabeled as agentic - and it is worth being precise about why. Workflow automation platforms route data between systems according to predefined logic. They can include AI nodes: an n8n workflow might send text to an LLM for classification and branch based on the output. But the structure is fundamentally deterministic. A human built the graph. The system follows it.

This is not a limitation to apologize for. Deterministic workflow automation is reliable, auditable, and predictable in ways that stochastic LLM agents currently are not. But it is not autonomous action in the Wooldridge-Jennings sense. The system cannot adapt to conditions it was not programmed to handle. It cannot make judgment calls outside its predefined paths.

The structural constraint: Deterministic, not adaptive. Strong reliability; no genuine autonomy. Proactiveness and long-horizon planning are absent by design.

Tier 5: Autonomous Agents

Examples: OpenClaw and similar long-horizon autonomous agent frameworks

In this taxonomy, autonomous agents are persistent, long-running systems that maintain context across sessions, take actions in the world on an ongoing basis, and adapt their behavior based on accumulated operational experience. They do not wait to be invoked. They monitor conditions, prioritize work, and execute as a matter of ongoing operation - structurally closer to an employee than a tool.

This is the most capable tier in the taxonomy - and also, currently, the tier with the most significant production reliability challenges.

The structural constraint: The capabilities that make this tier powerful - LLM-driven decision-making, flexible multi-step action, persistent autonomous operation - also make it the hardest to keep reliable at scale.

The Production Reliability Gap

Here is the uncomfortable reality: even the best autonomous agents available today achieve task success rates that would be unacceptable in any other software system.

WebArena (Carnegie Mellon, 2023) evaluated GPT-4-based agents on realistic web tasks - booking, information retrieval, form completion, e-commerce workflows. The best agents achieved a 14.41% end-to-end task success rate. Human performance on the same tasks was 78.24% [6].

OSWorld (NeurIPS 2024) evaluated multimodal agents on real computer tasks - using GUI interfaces, navigating applications, completing file operations. The best model achieved 12.24%. Human performance was 72.36% [7].

AgentBench (Tsinghua University, ICLR 2024) found that even top commercial models fail on multi-step agent tasks primarily due to "poor long-term reasoning, decision-making, and instruction following abilities." Open-source models under 70B parameters scored near zero on demanding environments [8].

tau-bench (2024, ICLR 2025) measured something more revealing than single-trial success: it ran eight trials per task and measured whether agents produced consistent results. GPT-4o solved only 35.2% of tasks on the harder airline domain; the pass^8 metric - which asks whether an agent reliably gets the same task right across eight attempts - fell below 25% for all tested models on retail tasks [9].

That last metric matters. It is not just that agents fail. It is that they fail inconsistently - which means you cannot predict or design around the failure. An agent that reliably fails is at least auditable. An agent that randomly fails is unpredictable in production.

The Error Compounding Problem

The mathematical structure of why this happens is straightforward. A pipeline with ten sequential steps, each completed correctly 90% of the time, has a 35% end-to-end success rate (0.9^10 = 0.349). At 95% per-step accuracy - a generous assumption by current benchmark standards - a ten-step process succeeds end-to-end roughly 60% of the time.

A 2026 paper formalizing this problem, "The Six Sigma Agent," models LLM pipelines using defect-per-step rates and shows that single-agent systems with a 5% per-step error rate achieve approximately 60% reliability on a ten-step pipeline - while consensus voting across five parallel agents reduces that error rate to 0.11% [10].

Tobias Ord's 2025 analysis of agent "half-life" extends this mathematically: success rates decline exponentially as task length increases, with each agent having a characteristic task duration at which its success rate drops to 50% [11]. The longer the workflow, the lower the probability of end-to-end success - regardless of which model you use.

This is not a model quality problem. It is a structural problem with using probabilistic systems for deterministic multi-step execution. MetaGPT researchers named it precisely: "logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs" [12]. Each step in a sequence introduces new uncertainty; the output looks coherent but contains accumulated errors that are difficult to detect without reviewing intermediate states.

A 2026 paper on agentic confidence calibration formalized why this is so difficult to detect and correct: "an early low-confidence decision can 'poison' the entire subsequent execution path, leading an agent to hold high confidence in incorrect results" [13]. The model does not know it has gone wrong. Downstream steps proceed on a corrupted foundation.

Why This Is an Architecture Problem, Not a Model Problem

The intuitive response is to wait for better models. The research evidence suggests this is the wrong frame.

At ICML 2024, Subbarao Kambhampati and colleagues published "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks" [14]. The core argument: auto-regressive language models cannot independently perform reliable multi-step planning or self-verification. This is not a capability gap that scales away with more parameters - it is a property of the architecture. Stochastic next-token prediction is not a reliable foundation for deterministic sequential execution.

The evidence from Rabanser et al. (2026), surveying 14 models across two benchmarks, is direct: "recent capability gains have only yielded small improvements in reliability" [15]. Model capability and model reliability are not the same dimension. A smarter model is not, necessarily, a more reliable operator.

Kambhampati's proposed solution - LLM-Modulo Frameworks - combines LLMs with external symbolic verifiers that can check decision correctness independently of the model that produced the decision. This is neuro-symbolic architecture: the LLM contributes reasoning and language understanding; the symbolic system contributes verifiable logic and deterministic structure.

MetaGPT arrived at the same conclusion through a different path, encoding Standardized Operating Procedures (SOPs) into multi-agent workflows to constrain where LLMs operate [12]. When the structure of the work is fixed - the what, the sequence, the handoffs, the verification checkpoints - the LLM can focus on judgment at the decision points that genuinely require it, rather than also deciding how to sequence and verify its own work.

The ReAct framework (Yao et al., ICLR 2023) demonstrated empirically that interleaving reasoning and action through external tool calls - rather than relying on LLM chain-of-thought alone - overcomes "issues of hallucination and error propagation prevalent in chain-of-thought reasoning," improving performance on decision-making benchmarks by 34% in absolute terms [16].

Microsoft's AutoGen framework, from a multi-institutional team published in 2023, validated the multi-agent approach: breaking complex tasks into specialized conversational agents with explicit role separation consistently outperformed single generalist agents on complex tasks [17].

What Deterministic + Reasoning Actually Means in Practice

The path forward is not choosing between pure LLM autonomy and pure workflow automation. It is a principled combination of both.

Deterministic structure governs the workflow: what steps happen, in what order, with what handoffs between specialized agents, and what verification checkpoints must pass before the next step begins. This layer is not LLM-driven - it is architected. The LLM cannot go off-script here.

LLM reasoning operates within that structure at the points that genuinely require judgment: classifying ambiguous input, generating content, making decisions that cannot be reduced to rules. The language model adds value in the spaces the deterministic structure leaves open - not in deciding how to sequence the work, but in doing the parts of the work that require language and reasoning.

Multi-agent specialization allows each agent in the system to be narrow and reliable within its domain. Research on multi-agent specialization (Mieczkowski et al., 2025) shows that specialization outperforms generalism when task parallelizability is high - when many subtasks can run concurrently - a property that characterizes most real business workflows [18].

Under this architecture, reliability stops being a property of the underlying model and becomes a property of the system: the guardrails, the verification checkpoints, the handoff protocols, and the escalation paths that catch errors before they compound into downstream failures.

What this looks like in practice:

A customer support escalation workflow illustrates the architecture clearly:

Deterministic layer: classify request type, verify account tier, check refund policy eligibility, route by issue category, require human approval above refund threshold.

LLM layer: summarize conversation context, draft customer response, identify ambiguity that requires clarification before proceeding.

Guardrail layer: no refund execution without policy match, no outbound message without approval above threshold, no account changes without identity verification.

In this structure, the LLM never decides whether a refund is approved - that is a deterministic rule. The LLM decides how to communicate the outcome in natural language - that is judgment. The architecture separates what can be automated from what requires language understanding, and puts structural controls around both.

NeMo Guardrails (NVIDIA, EMNLP 2023) established the practical tooling for this approach: user-defined, model-independent "rails" that enforce topical constraints, dialogue paths, and safety restrictions at inference time rather than training time [19]. The key insight is that rails are architecturally separate from the model - they do not require retraining to change, and they do not rely on the model's own judgment to enforce.

This is the gap between where most autonomous agents are today and where they need to be for production deployment. Gartner's prediction that over 40% of agentic AI projects will be canceled by 2027 is not a statement about the technology's ceiling - it is a statement about what happens when organizations deploy probabilistic autonomy into workflows that require deterministic reliability, and then discover the gap.

The organizations that understand this distinction and build toward deterministic + reasoning architectures will deploy AI operators that actually close work without a human watching every step. The organizations that treat model capability as the only variable will keep running into the same reliability ceiling - the 14%, the 12%, the sub-25% consistency scores - regardless of which model they choose next.

What to ask when evaluating an AI agent:

When a vendor claims their product is an autonomous agent, the architecture questions that matter are:

Persistence: Does it maintain memory and context between sessions, or does each interaction start from scratch?
Bounded vs. ongoing autonomy: Does it complete discrete tasks on request, or does it operate continuously across an ongoing workload?
Verification checkpoints: Where does the system check its own outputs before proceeding to the next step?
Escalation paths: What happens when the system encounters a decision outside its design envelope? Does it fail silently, or does it surface the decision to a human?
Auditability: Can you trace why the system made a specific decision at a specific point?
Failure containment: When something goes wrong, how far does the error propagate before it is caught?

A system with honest answers to these questions is either genuinely capable or genuinely limited in a way you can design around. A system that deflects these questions is likely applying the word "agent" to something that does not carry the weight.

GetLatest AI builds AI operators designed around deterministic + reasoning architecture - systems that run business workflows end-to-end without requiring a human in the loop at every step. To see how this works in practice, book a call with our team.

References

[1] Gartner. (2025, September 30). Gartner Survey Finds Just 15 Percent of IT Application Leaders Are Considering, Piloting, or Deploying Fully Autonomous AI Agents. Gartner Press Release. https://www.gartner.com/en/newsroom/press-releases/2025-09-30-gartner-survey-finds-just-15-percent-of-it-application-leaders-are-considering-piloting-or-deploying-fully-autonomous-ai-agents

[2] Gartner. (2025, June 25). Gartner Predicts Over 40 Percent of Agentic AI Projects Will Be Canceled by End of 2027. Gartner Press Release. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027

[3] Wooldridge, M., & Jennings, N.R. (1995). Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2), 115-152.

[4] Russell, S.J., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.

[5] Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can language models resolve real-world GitHub issues? arXiv:2310.06770. In Proceedings of ICLR 2024.

[6] Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2023). WebArena: A realistic web environment for building autonomous agents. arXiv:2307.13854. In Proceedings of ICLR 2024.

[7] Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., & Yu, T. (2024). OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv:2404.07972. In Proceedings of NeurIPS 2024.

[8] Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., ... Tang, J. (2023). AgentBench: Evaluating LLMs as agents. arXiv:2308.03688. In Proceedings of ICLR 2024.

[9] Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv:2406.12045. In Proceedings of ICLR 2025.

[10] Patel, K., Surendira, S., George, J., & Kapale, S. (2026). The Six Sigma agent: Achieving enterprise-grade reliability in LLM systems through consensus-driven decomposed execution. arXiv:2601.22290.

[11] Ord, T. (2025). Is there a half-life for the success rates of AI agents? arXiv:2505.05115.

[12] Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S.K.S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., & Schmidhuber, J. (2023). MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv:2308.00352.

[13] Agentic Confidence Calibration. (2026). arXiv:2601.15778.

[14] Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., & Murthy, A. (2024). LLMs can't plan, but can help planning in LLM-Modulo frameworks. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna.

[15] Rabanser, S., Kapoor, S., Kirgis, P., Liu, K., Utpala, S., & Narayanan, A. (2026). Towards a science of AI agent reliability. arXiv:2602.16666.

[16] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. arXiv:2210.03629. In Proceedings of ICLR 2023.

[17] Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A.H., White, R.W., Burger, D., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155. Microsoft Research.

[18] Mieczkowski, E., Mon-Williams, R., Bramley, N., Lucas, C.G., Velez, N., & Griffiths, T.L. (2025). Predicting multi-agent specialization via task parallelizability. arXiv:2503.15703.

[19] Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., & Cohen, J. (2023). NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails. arXiv:2310.10501. In Proceedings of EMNLP 2023 - Demo Track.

Why Most AI Systems Aren't Actually Agents: A Taxonomy of What Qualifies - and the Architecture Gap That Explains the Reliability Problem