What Cheating Reveals

The phrase “if you ain’t cheating, you ain’t trying” is ugly in sports, but I think it points to something real in research.

In sports, the rules are constitutive. They are not an inconvenience layered on top of the activity; they define the activity. If you cut a corner in a race, use a hidden motor in a bike, or steal signals in a game where sign secrecy is part of the contest, you are not discovering a deeper version of the sport. You are eroding the thing itself.

But in open-ended work, the situation often feels different. In research, engineering, and science, the goal is usually not to preserve the sanctity of the current procedure. The goal is to get to a better understanding of the world, or to solve a problem more effectively. This is part of why the language of cheating can be misleading in research. Within a Kuhnian period of “normal science,” a field has a shared sense of which questions matter and which methods are legitimate. A move that falls outside that frame can look like a shortcut or a violation. But sometimes those illegitimate-looking moves are exactly what expose the limits of the current paradigm. In that sense, the pressure to “cheat” can be informative: it may reveal not just a weakness in the benchmark, but the outline of a better problem setting.

A Gradient Hidden in the Rules

I think there is a useful heuristic here.

Within a fixed problem setting, ask: what is the cheapest and most tempting way to cheat?

That question often reveals a gradient in the space of problem formulations. It tells you what the current setup is making unnecessarily expensive. If the same “forbidden” move keeps reappearing, that may be evidence that the field is suppressing a more productive interface to the real task.

A simple example is arithmetic. On some exams, using a calculator is cheating. That makes sense if the point of the exam is to test unaided fluency with manual computation. But as soon as the real goal shifts from arithmetic drill to solving quantitative problems, the calculator stops looking like corruption and starts looking like infrastructure. Scale that same move up a bit further and you get spreadsheets, symbolic algebra systems, numerical software, and programming. The original prohibition makes sense locally, but the forbidden action points toward a more useful problem setting.

Something similar is happening now with writing and coding. In a university course, using ChatGPT to draft your assignment or write your code may be cheating because the course is trying to measure individual competence under a specific set of constraints. But modern software engineering already relies on a dense stack of external cognitive tools: compilers, debuggers, search engines, documentation, copied snippets, test harnesses, CI, linters, static analyzers, and now language models. From the perspective of the classroom benchmark, some of this looks like outsourcing the work. From the perspective of the field, it is just the way the work is done.

I think machine learning has many examples of the same pattern. There are settings in which manually adjusting hyperparameters feels like illegitimate hand-holding. But once the field acknowledges that tuning is part of the real optimization problem, the “cheat” gets absorbed into the legitimate method. Then we get hyperparameter search, Bayesian optimization, neural architecture search, data curation pipelines, retrieval systems, and increasingly elaborate test-time optimization loops. The field redefines the task so that what used to be an illicit intervention becomes an explicit degree of freedom.

Historical Examples

A lot of scientific and technical progress has this flavor.

Before calculators and computers, many scientific workflows required long stretches of hand computation. Mechanical aids might once have looked like a kind of dilution of mathematical seriousness. But the purpose of astronomy was never to preserve hand calculation as a craft. Once machines could do the routine manipulation more reliably, the center of gravity of the field shifted upward.

Chess offers a clean finite/infinite game contrast. Using an engine during a tournament game is plainly cheating, because tournament chess is a finite game with explicit rules. But using engines to analyze positions, prepare openings, or discover endgame structure is not cheating in the broader intellectual project of chess. It is now a central way that chess knowledge is produced. The same tool that would invalidate the finite game can accelerate the infinite one around it.

There are older examples in mathematics as well. Computer-assisted proofs were initially met with discomfort because they violated a certain picture of what a proof ought to feel like: surveyed end-to-end by a human mind. But that discomfort was productive. It forced mathematicians to clarify what they actually required from proof, verification, and trust. The method first looked like a breach of the old standard and then became part of a larger understanding of what mathematical verification can be.

Even laboratory science often advances by absorbing what first looks like illegitimate convenience. Automation, high-throughput screening, simulation, standardized software pipelines, and statistical packages all reduce the amount of direct artisanal manipulation by the scientist. At some earlier point, each of these could have been described as a weakening of the “real” skill. In practice, they often make it possible to work on more important questions.

This is not because all shortcuts are good. It is because some shortcuts reveal the next abstraction.

Finite and Infinite Games

I think the distinction between finite and infinite games helps here.

In a finite game, the rules are part of the objective. The point is to win under those rules. Sports, exams, and some forms of certification work like this. The boundary conditions are not accidental. They are the thing being tested.

In an infinite game, the rules are much more provisional. They are scaffolding. We create them because they help organize work, compare methods, or make progress legible. But we are allowed to replace them if they cease to serve the deeper objective.

Benchmarks in machine learning often live in an uncomfortable middle ground. They are treated rhetorically like finite games, but strategically like infinite ones. We talk as though the benchmark itself is sacred, while everyone knows the real objective is broader: better systems, more robust methods, stronger understanding, more useful tools. This is part of why benchmark culture so often oscillates between useful discipline and sterile gaming.

If you take the finite-game view too seriously, you end up fetishizing the benchmark and confusing compliance with progress. If you abandon it entirely, you get contamination, leakage, and reward hacking. The hard part is knowing when a “cheat” is destroying the task and when it is pointing to a better one.

When the Signal Is Real

I think the distinction is roughly this.

Some forms of cheating exploit artifacts of the evaluation while severing the connection to the real objective. Training on test labels, leaking benchmark answers, cherry-picking runs, or tuning specifically to quirks of the leaderboard are all in this category. These are not glimpses of the future. They are just failures of measurement.

Other forms of cheating preserve the real objective while violating a local convention about how the objective is supposed to be pursued. Those are much more interesting. They often reveal hidden resources that the benchmark has been pretending away: compute, tools, external memory, prior data, human intervention, or test-time adaptation. Once those resources are made explicit and priced correctly, the cheating often disappears as a moral category. It becomes just another algorithmic choice.

This is why I think “cheating pressure” can be a useful research lens. It points to where the current interface is wrong.

A Research Heuristic

One way to turn this into a concrete methodology would be:

Start with the benchmark or standard problem setting.
Ask what the most efficient way to cheat would be.
Separate benchmark exploits from genuine interface improvements.
For the latter, define a new problem setting that makes the hidden resource explicit.
Raise the standard again from there.

This feels close to how many fields actually evolve. We ban a shortcut, notice that everyone wants it, realize that it is compressing something important, legalize it, and then move the benchmark upward. Calculators become acceptable, then expected. Search becomes acceptable, then expected. Simulation becomes acceptable, then expected. LLM assistance may be going through the same transition now in many domains.

So I do think there is something interesting here. The claim is not that cheating is good. It is that the desire to cheat is often informative. In open-ended fields, it can reveal where a benchmark is misaligned with reality, where a workflow is ripe for automation, or where the next problem setting should be.

Sometimes the right response is to harden the rules.

Sometimes the right response is to change the game.