Top 5 Unsaturated Evals to Run Before GPT-5 Arrives
GPT-5 is coming. Are your benchmarks ready? Discover the top 5 unsaturated evals like AgentBench and SWE-bench that truly test the limits of AI reasoning and planning.
Dr. Alex Carter
AI researcher and ML engineer focused on model evaluation and next-generation capabilities.
The AI world is buzzing with anticipation for GPT-5. While we wait for what promises to be another monumental leap in AI capability, a familiar problem is resurfacing for developers and researchers: benchmark saturation. Leaderboards for staples like MMLU are crowded with models scoring in the high 90s, making it difficult to measure true, groundbreaking progress.
To truly understand the next generation of AI, we need to move beyond the solved problems and focus on the frontiers where current models still struggle. These "unsaturated" evaluations test the complex reasoning, planning, and multi-modal skills that separate mere pattern-matching from genuine intelligence. Running them now will give you a crucial baseline to appreciate just how far GPT-5 has come when it finally arrives.
1. AgentBench: Testing True Digital Assistants
The dream of AI has always been about more than just conversation; it’s about creating capable agents that can perform tasks on our behalf. This is where most current models fall short, and where AgentBench shines a harsh but necessary light.
What is AgentBench?
AgentBench is a comprehensive benchmark designed to evaluate LLMs as agents across a diverse set of eight environments. It moves beyond simple Q&A to test a model's ability to operate software, interact with websites, and even play games. For example, a task might involve using a mock web browser to find specific information and book a flight, a process requiring multi-step planning, tool use, and error correction.
Why It's Unsaturated
GPT-4 and its contemporaries are notoriously bad at long-term planning and self-correction without heavy scaffolding. They can get stuck in loops, misunderstand tool outputs, or simply "hallucinate" a successful outcome. AgentBench’s multi-turn, interactive nature exposes these weaknesses brutally. A model can't just generate a plausible-sounding answer; it has to execute. Current top models often struggle to achieve success rates above 30-40% on many of its tasks, leaving massive room for improvement.
2. SWE-bench: Beyond FizzBuzz and Into Real-World Code
LLMs are great at generating isolated functions or solving simple LeetCode problems. But can they function as a junior software engineer? SWE-bench (Software Engineering Benchmark) was created to answer that question, and the answer, for now, is a resounding "not yet."
What is SWE-bench?
SWE-bench tasks models with solving real-world GitHub issues from popular open-source projects like Django and Matplotlib. This isn't about writing a function from scratch. It's about navigating a large, unfamiliar codebase, understanding the context of a bug report, identifying the relevant files, writing a patch, and ensuring it passes the existing test suite. It’s a holistic test of software engineering capability.
Why It's Unsaturated
The core challenge is context. A model needs to understand the intricate dependencies and architecture of a project with hundreds of files. Even models with million-token context windows struggle to effectively utilize that information to pinpoint a specific bug. The best autonomous agents, including those based on GPT-4, solve less than 15% of the issues in the benchmark, demonstrating just how far we are from AI-powered autonomous software engineers.
Evaluation Landscape at a Glance
Evaluation | Primary Skill Tested | Modality | Why It's Hard |
---|---|---|---|
AgentBench | Planning & Tool Use | Text, Simulated Env. | Requires multi-step execution, error correction, and long-term goal coherence. |
SWE-bench | Complex Code Gen & Debugging | Code, Text | Involves understanding large, existing codebases and context-aware problem-solving. |
MATH | Abstract Reasoning | Text, LaTeX | Demands novel, multi-step logical deduction and symbolic manipulation. |
MM-Vet | Compositional VLM | Image, Text | Tests the ability to combine multiple visual concepts and spatial relationships. |
CausalQA | Causal Inference | Text | Forces the model to distinguish cause-and-effect from simple correlation. |
3. The MATH Dataset: The Ultimate Test of Pure Reasoning
If you want to measure raw intelligence and problem-solving ability, look no further than mathematics. The MATH dataset, compiled by Hendrycks et al., serves as one of the most challenging benchmarks for pure reasoning.
What is the MATH Dataset?
This isn't your grade-school arithmetic. The MATH dataset consists of 12,500 problems from high school math competitions like the AMC 10/12 and AIME. These problems cover subjects like algebra, geometry, number theory, and precalculus, and are designed to require creative, multi-step reasoning. A correct answer isn't enough; the model must produce a step-by-step solution that demonstrates true understanding.
Why It's Unsaturated
Despite significant progress, even the most advanced models like GPT-4 struggle mightily with the MATH dataset, with top scores hovering around 50-60% with complex prompting techniques. The problems require a level of symbolic manipulation, logical deduction, and novel problem-solving strategies that LLMs, trained on statistical patterns, find incredibly difficult. A significant jump on this benchmark by GPT-5 would be a clear signal of a fundamental improvement in its reasoning core.
4. MM-Vet: Can LLMs *Really* See and Reason?
Multi-modal models like GPT-4V can describe images with impressive accuracy. But describing is not the same as understanding. MM-Vet (Multi-modal Vet) is a benchmark designed to probe the deeper compositional reasoning capabilities of these vision-language models (VLMs).
What is MM-Vet?
MM-Vet evaluates VLMs on their ability to handle questions that require integrating multiple visual and textual concepts. Instead of asking, "What's in this image?", it asks things like, "What is the color of the shirt worn by the person standing to the left of the small dog?". This tests a model's grasp of spatial relations, attribute binding, and compositional understanding—the building blocks of true visual reasoning.
Why It's Unsaturated
Current VLMs often fail on these compositional tasks. They might correctly identify a person and a dog but fail to link their spatial relationship to the question. They struggle to parse complex instructions and apply them to the visual scene. The benchmark uses GPT-4 itself to judge the correctness of responses, and even top open-source models score poorly, highlighting a major gap between perception and cognition in today's VLMs.
5. CausalQA: Moving From Correlation to Causation
This is perhaps the most intellectually challenging and fundamental evaluation on the list. LLMs are masters of correlation—they know that roosters crowing and sunrises happen together. But do they understand that one doesn't cause the other? CausalQA is designed to find out.
What is CausalQA?
CausalQA poses questions that require genuine causal inference. The questions are structured to have a confounding factor, where a simple correlational model would get the answer wrong. For example, given a text about a study showing that people who drink coffee live longer, a question might be, "Does drinking coffee *cause* a longer life?". A good model must infer from the text that this might be a correlation, as coffee drinkers might also have healthier lifestyles (the confounder).
Why It's Unsaturated
This is a known, deep weakness of current LLM architectures. They are trained to predict the next token based on co-occurrence in their training data, which is the very definition of learning correlation, not causation. Models often fall into the trap of stating the correlation as a causal link. Progress on CausalQA would indicate a monumental shift in a model's world understanding, moving it closer to a human-like ability to reason about cause and effect.
Key Takeaways: Preparing for the Next Frontier
As we stand on the precipice of GPT-5, the conversation around evaluation needs to mature. Focusing on these five unsaturated benchmarks offers a more insightful path forward:
- Move Beyond Rote Memorization: These evals test skills—planning, deep reasoning, and causal inference—not just stored knowledge.
- Set a Meaningful Baseline: Running these tests on today's best models (like GPT-4, Claude 3, and Gemini 1.5) will provide a concrete, quantitative measure of how significant the GPT-5 leap truly is.
- Focus on What Matters: Ultimately, we want AI that can solve real problems, write real code, and understand the world on a deeper level. These benchmarks are proxies for the very capabilities that will unlock the next generation of AI applications.
The next major AI model isn't just a point on a leaderboard; it's a new tool with a new set of capabilities and limitations. By stress-testing the current generation on the problems they can't solve, you'll be perfectly positioned to understand, leverage, and build upon the breakthroughs of tomorrow.