AI & Machine Learning

My 3-Step Plan for Unsaturated Evals Before GPT-5 (2025)

With GPT-5 on the horizon, our old AI benchmarks are failing. Discover a 3-step plan to create unsaturated, future-proof evaluations for next-gen models.

Dr. Alistair Finch

AI safety researcher and ML evaluation specialist focused on future-proofing model benchmarks.

September 8, 20256 min read155 views

6 min read

1,228 words

155 views

Updated

The entire AI community is holding its collective breath for GPT-5. But as the hype cycle spins up, a more fundamental question looms: how will we even know if it represents a true paradigm shift? Our current yardsticks are cracking under the weight of today's models, and they're about to become completely obsolete.

The End of an Era: Why Our Current Evals Are Failing

For years, benchmarks like MMLU (Massive Multitask Language Understanding) have been our North Star. They gave us a quantifiable way to track progress, crowning new state-of-the-art (SOTA) models on public leaderboards. But we're rapidly approaching the point of benchmark saturation.

When top models like GPT-4, Claude 3, and Gemini consistently score 90% or higher on a benchmark, that benchmark loses its descriptive power. The remaining 10% is often a mix of ambiguous questions, flawed data, or edge cases that don't represent a meaningful gap in general capability. It's like using a yardstick to measure a molecule; the tool is no longer fit for the task.

Worse, we face the pervasive issue of data contamination. These benchmarks are static and public. It's almost certain that their data has been hoovered up into the massive training sets of next-generation models. When a model has seen the test questions (or close paraphrasings) during training, its high score isn't a sign of reasoning—it's a sign of memorization. We're not testing intelligence; we're grading a take-home exam where the student had the answer key all along.

My 3-Step Plan for Future-Proofing AI Evaluation

To truly understand the capabilities of GPT-5 and beyond, we need to move past these saturated, static tests. We need a new evaluation philosophy. Here's my three-step plan to get there before 2025.

Step 1: Sunset the Saturated Benchmarks

The first step is a hard one: we must consciously de-emphasize and retire the benchmarks that have served their purpose. This doesn't mean they were useless; it means they've graduated. Continuing to chase the last few percentage points on MMLU or HellaSwag is an inefficient use of research and compute.

Instead of treating them as the primary indicator of SOTA, we should reframe them as foundational competency checks. Does a new model clear the 90% bar on these tests? Great, it has the basic knowledge and language skills. Now, let's move on to the real challenges.

This requires a cultural shift in the AI community. We need to resist the allure of a single, simple leaderboard score and embrace the messiness of more complex, qualitative evaluations. The goal is no longer to inch up a leaderboard but to map the vast, unknown territory of a model's true capabilities.

Step 2: Discover and Prioritize the Frontier

If the old benchmarks are retired, what replaces them? We need to focus on unsaturated evals—tasks where current models still struggle significantly. These are the new frontiers where true progress can be measured.

I group these frontier evals into three main categories:

Complex Reasoning & Agentic Tasks: These go beyond simple Q&A. They require multi-step planning, tool use, and adaptation. Think of benchmarks like GAIA, which poses questions that require web browsing, document analysis, and logical deduction. Or SWE-bench, which tasks models with solving real-world GitHub issues in a codebase. Success here isn't about recalling a fact; it's about demonstrating a workflow.
Long-Context & Multimodal Coherence: Can a model maintain a coherent train of thought across millions of tokens of context? We need evals that test for this specifically, not just as a side effect. This means feeding a model a full-length movie, three dense research papers, and a podcast transcript, then asking it to synthesize a novel insight that connects all of them. The evaluation isn't just about the final answer, but the logical consistency and absence of hallucination throughout its reasoning.
Creative & Subjective Generation: This is perhaps the hardest frontier. How do you score a poem, a piece of music, or a business strategy? Static benchmarks fail here. The future of evaluation in this domain is pairwise comparison with human preference (the "chatbot arena" model), but with a twist: using domain experts. We don't just need to know which response is "better," but why. Is the code more efficient? Is the marketing copy more persuasive? Is the joke actually funny?

Here’s how these new frontier evals stack up against the old guard:

Evaluation Type	Saturated Benchmarks (The Old Guard)	Unsaturated Evals (The Frontier)
Example	MMLU, GLUE, SQuAD	GAIA, SWE-bench, AgentBench
Primary Skill Tested	Knowledge Recall & Language Understanding	Reasoning, Planning, Tool Use, Synthesis
Data Freshness	Static, prone to contamination	Dynamic, often based on live, unseen data
Success Metric	Single accuracy score (%)	Task completion rate, human preference, qualitative review
Ceiling	Low (models are at 90%+)	High (models are often below 30-40%)

Step 3: Build a Living Evaluation System

The final, and most crucial, step is to stop thinking of evaluation as a static set of tests. We need to build a dynamic, human-in-the-loop evaluation framework—a living system that evolves alongside the models it measures.

This system has three components:

Continuous Generation of New Problems: Instead of a fixed dataset, the framework should constantly generate novel problems. For coding, this could mean pulling new issues from GitHub every day. For reasoning, it could involve using LLMs themselves to generate complex, multi-step logic puzzles that are guaranteed not to be in any training set.
Expert Human Feedback Loops: For subjective and high-stakes tasks, automated metrics are not enough. We need a system for routing model outputs to verified human experts. A model's attempt to diagnose a medical scan should be reviewed by a radiologist. Its legal contract analysis should be checked by a lawyer. The feedback from these experts—both the score and the qualitative reasoning—becomes the new ground truth.
Adversarial Testing (Red-Teaming) as an Eval: Evaluation shouldn't just be about what a model can do, but also what it shouldn't do. A core part of the evaluation framework must be a continuous red-teaming effort, where humans and other AIs are incentivized to find and document model failures, biases, and vulnerabilities. Success is measured by how robust a model is to these adversarial attacks.

Key Takeaways: Preparing for the Next Wave

As we stand on the precipice of the next major AI leap, our ability to measure it is what matters most. Without robust evaluation, we're flying blind.

Here’s the plan in a nutshell:

Retire the Old Guard: Acknowledge that benchmarks like MMLU are saturated. Reframe them as basic competency checks, not the ultimate prize.
Focus on the Frontier: Shift all new evaluation efforts to unsaturated areas like complex agentic tasks, long-context synthesis, and expert-judged subjective generation.
Build a Living System: Move from static datasets to a dynamic framework that incorporates continuously generated problems, expert human feedback, and adversarial testing.

The arrival of GPT-5 won't be marked by a new score on an old leaderboard. It will be demonstrated through capabilities we are only now building the tools to measure. By adopting this forward-looking evaluation plan, we can ensure we're ready to not just witness the future of AI, but to actually understand it.