the claims at the end.
Considering preparing for a math exam, one might watch the lecture recordings, spend some time staring at a theorem until it starts to make sense, and perhaps work through some past questions. However, we usually ask our AI agents to work on whatever task without any preparation alike. No matter whether we want it to do a deep research style literature review, or to code with a new Python library, we never give it time to study and we just expect it to work. (If it didn’t, we wait for the next release from OpenAI/Anthropic)
Now let’s look at studying carefully. The interesting thing about studying is that it just works before we know the exact exam question. It’s a process that turns preparation into expertise! We want to introduce the same setup for agentic systems, and we call it Machine Studying, in parallel to Machine Learning. Machine Learning asks how a system can improve from data when we know what signal it is supposed to learn from. Machine Studying asks what a system should do when it is given a declarative corpus without a downstream task yet.
We view current AI agents as agentic systems, in the sense that they are much more than a transformer model. A typical product such as Claude Code can search the web or local sources, use different tools, and interact with external assets such as a codebase. So we consider the complete agentic system as the unit of analysis. Specifically, an agentic system is composed of the model itself; its context, which is also configurable and usually contains detailed system prompts; a specific harness such as a ReAct loop; assets that the model can interact with, such as a codebase, datasets, or retrieval indices; neural auxiliaries such as adapters or fast and slow weights, for example LoRA and KV cache; and a set of tools that the model has access to during inference. With this definition, we have a clearer picture of where Machine Studying could land. After a system studies, it does not have to only be a model weight update. It could also update any component of the system, such as a more detailed prompt in the context, a new retrieval index, or a specifically evolved harness.
The formal version of this setup is included below for reference.
We represent a language model system as an m-chant tuple $\Sigma = (\mathbf{M}, \mathbf{C}, \mathbf{H}, \mathbf{A}, \mathbf{N}, \mathbf{T})$, where $\mathbf{M}$ is the underlying model, $\mathbf{C}$ the available context, $\mathbf{H}$ the harness (e.g., ReAct scaffolds, RLM control flow, language model programs such as DSPy, and inference policies that regulate deliberation style), $\mathbf{A}$ non-neural assets such as corpora and indices, $\mathbf{N}$ neural auxiliaries coupled to the base model but external to its core parameters (e.g., modular adapters and prefix-tuning states), and $\mathbf{T}$ the tool set.
Let $\mathbf{D}$ denote a declarative corpus. We assume $\mathbf{D}$ exceeds the system's high-quality context window and lacks labels, rewards, or an explicit task distribution. A study procedure is what an agentic system does to itself before evaluation, with access to $\mathbf{D}$ but no advance knowledge of the test. For procedure $\pi$, we write
$$\Sigma^{\pi}_{\mathbf{D}} = \pi(\Sigma, \mathbf{D})$$for the system after studying; the procedure may rewrite any component of the m-chant tuple. A degenerate no-study procedure leaves $\Sigma$ unchanged and shifts all adaptation to test time.
Suppose you buy the setup. A very natural objection to Machine Studying is: why do we need the model to study if these agentic systems are able to call tools and interact with the codebase at test time, spending lots of agentic steps during test time? Perhaps all we need is just a harness that could spend more tokens at inference time. I would usually talk about this by having people recall the pre-reasoning-model era. Back then, a prevalent belief was that we should try to have language models remember only some axioms and have really strong reasoning capability, and thus they could just deduce many things in real time. This would largely solve some problems like hallucination. Now most people would be with me if I say that reasoning and knowledge are not so separable. Thus, I would argue the same with harnesses, or tool use in general. Sometimes I call this the offloading fallacy. Indeed it might be easier to argue in the tool-use case than in the reasoning one. After all, the model has to decide what to search for, which file to open, which tool to call, what counts as relevant, and when to stop. These are not at all neutral operations. They are produced by the model’s current state, which could carry the wrong bias. This is why a study phase can matter before the agent ever starts acting.
And then I have a very simple demo on this point with one of my all-time favorite models, Sonnet 4.6. I asked it to write me code using transformers to load the Qwen 3.6 model, which was released this year. Very interestingly, instead of simply searching for Qwen 3.6, it likely believed that I made a typo and searched for Qwen3, and found a Qwen3-0.6B model. This is a model that is very capable and can search, but at the point where it searches, it searches based on its own bias. And yes, this is an explicit case, but you could see how it can happen at a much larger scale, and in much more implicit places.
To study Machine Studying in a concrete setting, we built several mini-benchmarks. The first two are programming benchmarks targeting DSPy and OpenClaw. DSPy is xxx. We have 30 questions. OpenClaw is xxx. We have 20 questions.
The questions are generated with GPT-5.4 in Codex at xhigh effort. To make the questions grounded, we provide the model with the complete documentation for the library, and ask it to generate both questions and reference answers. To make the questions reflect real user needs, we also seed the generation with real user questions. For DSPy, the seed questions come from community forums. For OpenClaw, they come from GitHub issues. We filter these raw questions to keep only genuine and useful ones, and cluster them by topic using Qwen3-Embedding-8B. With this procedure, we are able to generate grounded, realistic, and diverse programming problems. For this first study, we only use a tiny subset of the generated problems because of cost and time constraints. Each task comes with a weighted grading rubric. Core checks carry most of the weight and test whether the solution actually exercises the library’s real abstractions and APIs. Supplemental checks pick up supporting details, such as formatting or edge cases. Below is one random sample from each benchmark.
I've got a dspy.ReAct agent that answers arithmetic word problems and is given a Python add function as a tool. I want an evaluation harness that gives an example credit ONLY when the final answer is correct AND the agent genuinely used the calculator to get there. The reason I care: some of these agents just blurt out the right number and immediately finish without ever calling the tool, and those runs must score zero — a correct answer alone is not enough.
My current metric pulls the tool calls off the prediction (I look at pred.tool_calls like I would with native function-calling / dspy.ToolCalls) and checks whether add is among them, but it never awards credit — even on runs where I can see from the logs that the agent clearly did call add. So my whole devset scores 0%. Give me a small, runnable harness (devset of a couple of arithmetic examples, a custom metric, and a run over the devset that prints an overall percentage plus per-example scores) that scores this correctly. Use DummyLM so it runs offline with no API key.
| id | type | weight | what it checks |
|---|---|---|---|
| 1 | core | 70 | Metric gives credit only if the answer matches and the agent used the tool, detected by scanning pred.trajectory for a tool_name_* whose value is 'add'. Must not read pred.tool_calls (ReAct never populates it), and must not pass on a finish-only trajectory. |
| 2 | core | 10 | The program is a dspy.ReAct(..., tools=[add]), so predictions carry trajectory (not an OpenAI-style tool_calls). |
| 3 | core | 12 | Runs dspy.evaluate.Evaluate over a devset of dspy.Example(...).with_inputs('question'). |
| 4 | supp | 8 | Reads per-example scores from result.results and the overall from result.score. |
I've got a set of isolated cron jobs, each scheduled independently. Most should use my usual model-fallback chain, but a couple of high-priority jobs I want to pin to a single model with no fallbacks at all — if that model is down, just fail rather than silently drift onto a cheaper backup. So on those jobs I cleared out the fallback list (saved it as an empty list on the job) and left the rest untouched.
My understanding of how the per-job resolution is supposed to work: if a job doesn't carry its own fallback list, it inherits the agent's configured fallbacks; if it does carry one, we use that. Since "empty" is the natural way to express "no list here," I expected the empty-list jobs to behave like the un-pinned ones and just inherit — but I actually want the opposite for them, and I'm second-guessing whether clearing the list is even the right signal. Write the per-job resolver that produces the effective fallback chain for one of these scheduled jobs, so that a job which deliberately specifies "no backups" is honored as exactly that, while jobs that simply never set a list keep getting the normal agent-level chain. Match how the rest of this codebase already distinguishes those two states.
| id | type | weight | what it checks |
|---|---|---|---|
| 1 | core | 42 | (outcome) An explicit empty fallbacks array → returned as-is ("no backups"); only a truly absent (undefined) list inherits. Must not collapse [] into "unset" via .length / || / truthiness. |
| 2 | core | 28 | (mechanism) Decide unset-vs-set by array presence — Array.isArray(...) (or === undefined) + ?? fallthrough, so [] survives as a real value. |
| 3 | core | 15 | When absent, inherit by calling the repo's real resolveEffectiveModelFallbacks({cfg, agentId}) — not a hand-rolled default. |
| 4 | supp | 9 | Returns the job's own non-empty payload.fallbacks verbatim when kind === "agentTurn". |
| 5 | supp | 6 | Derives hasSessionModelOverride from a non-empty trimmed payload.model; gates on kind === "agentTurn". |
One might naturally ask that, if GPT-5.4 in Codex can generate the benchmark, why is the benchmark not already solved? It seems like one of two things must be true. Either the generated questions are low quality, in which case the benchmark is not meaningful. Or the generated questions are good, in which case GPT-5.4 in Codex must already be very good at utilizing a new library, and maybe Machine Studying is not a real problem after all. I think this is not the right conclusion! question generation and question answering are highly asymmetric tasks in our setup. During question generation, we intentionally use almost everything available to make the questions high quality, such as documentation, seed user questions, deterministic checkers, repeated feedback loops, optimized prompts, manual inspection, and a long iterative process for refining both the question and the answer. In fact, if one thinks about it carefully, generating a good repo-grounded coding task is itself a process needed studying: the system has to understand the library well enough to ask a meaningful question. Here we compensate for that with extra context, tool use, and human judgment. (Thus a single question can take up to an hour to produce.) During evaluation and many real life cases, however, the model does not have access to the documentation or the seed questions. It receives the codebase and has to solve the task through the allowed interface.
There is another reason we have to be careful here. Even after this asymmetry is clear, the benchmark score alone is still not the full object we care about. For corpus-grounded benchmarks, a score is not meaningful by itself unless we also account for how much work was used to get it. We would of course like our agent to score higher, but a sufficiently aggressive system can often buy accuracy by spending more work at test-time. For example, we could force the agent to inspect every file in the repository before answering any question. Or, if we do not care how new data is incorporated or how much it costs, the degenerate solution to continual learning is simple: whenever new data arrives, grab the old training set, append the new data, and train the whole model again. If I am taking an open-book medical exam and you give me twenty years, I can go to medical school, practice as a doctor, and then come back to answer the question. That entire process is my “harness.” This is also why we can’t just causally “assume infinite inference-time compute…” **In fact, if a system needs enormous inference-time compute or an enormous runtime scaffold to make progress, then the benchmark is telling us something about the cost of not studying. **
We evaluated GPT5.4-mini, GPT5.1, and Qwen3.5-9B. For each model, we evaluate four inference-time budgets in terms of ReAct loops:
zero-shot: The model receives only the problem description. This measures the model’s performance from parametric knowledge alone.ReAct-5: ReAct is a three-stage agentic loop. We provide three tools: grep, glob, and read_file, which allow the model to search for specific functions, search for file names, and inspect the details of a file. The model is capped at 5 iterations.ReAct-20: The same setup as ReAct-5, but the agent is capped at 20 iterations.ReAct-20 with no-early-return: The same setup as ReAct-20, but the agent must take at least 20 iterations before answering, with an invisible hard cap at 30 iterations.The grading largely consists of two parts. The first is a set of deterministic checks for syntax errors, API hallucinations, and whether the program compiles. Failing these deterministic checks gives an automatic zero. The second part uses GPT5.4 as the judge to grade the answer based on the rubric. Missing core items are also automatically zeroed. Below are the results for the two GPT models.
We choose GPT5.1 and GPT5.4-mini with the expectation that the two models have roughly similar capabilities. On some benchmarks, such as tau2-telecom, GPT5.1 outperforms GPT5.4-mini by a small margin, so we do not expect GPT5.4-mini to simply dominate because it is a stronger model. On DSPy, however, GPT5.4-mini beats GPT5.1 at every inference-time budget. At zero iterations, both sit at 0, since neither model can write a correct program without taking a single step. One thing to notice is the knowledge cutoff: GPT5.4-mini has a documented cutoff of Aug 31, 2025, while GPT5.1 has a cutoff of Sep 30, 2024. Given that DSPy became much more popular after 2024, we suspect that GPT5.4-mini simply knows more about DSPy than GPT5.1. In other words, the difference here may not be raw capability, but expertise: familiarity with how DSPy actually wants to be used.
For OpenClaw, both models score quite low. This is expected in a different way. OpenClaw is a newer library that neither model has likely seen during training, so there is much less prior familiarity to lean on. In absolute terms, both tasks are difficult. The best DSPy number is still under 40%, and OpenClaw barely clears 10% no matter how long we let the models think.
There is also an interesting difference between allowing more inference-time budget and forcing the model to use it. On DSPy, the repository is quite small (around 240 files.) If we allow the model up to 20 ReAct turns, it often stops much earlier. But when we force the model to use 20 ReAct turns, performance improves largely. This suggests that the issue is not only the size of the inference-time budget a decision of whether it has searched enough, and clearly very often that decision is wrong.
Before moving to smaller models, it is useful to look at the outcome distribution.
Missing core items dominate the errors across all four inference settings. When we inspect the trajectories, the pattern is often not that the model finds nothing. It usually lands on some plausible solution, and then engineers heavily around that suboptimal solution instead of continuing to search and doing things the way a repo expert would. Another fun observation is that GPT5.1 suffers much more than GPT5.4-mini from compile checks. Initially, it looked like the model was hallucinating nonexistent APIs. But after inspecting the outputs, I realized that it was often using deprecated APIs, probably something in its memory from the 2024 era.
Next, we evaluate Qwen3.5-9B. It is a good model given its size, but if we use the same grading rubric, we do not see many interesting distinctions because the scores are very low. In fact, it is surprising how poor the performance is, given how capable these models look on many standard benchmarks. So in this case, we also report a more lenient scoring condition where we remove syntax and compile checks, and we do not automatically zero the answer when the model misses core items. This lets us see whether the model is at least moving in the right direction, even when it fails the stricter version of the task.
The benchmark is also meant to test study algorithms, not only base models. Once we fix the model, the next question is what kind of preparation we can actually run before evaluation, and how much each preparation helps. Under the Machine Studying setup, this is constrained in a very particular way: we do not have gold downstream data, a reward function, or a known task distribution. So we test a few existing methods that can be made to fit this setting, using Qwen3.5-9B as the base model.
Continual pre-training. A very direct baseline is to continue training the model on the corpus itself. This is not obviously the right method for a post-trained model, but it is the simplest thing one might try. The objective is next-token prediction over unannotated raw data with cross-entropy loss. The risk is that this kind of update can damage capabilities that were acquired during post-training, such as instruction following, reasoning, and coding behavior. To make the baseline more reasonable, we restrict the update to a low-rank adapter with LoRA rank 128. We also mix in supervised coding trajectories generated by the original model itself, including reasoning traces, as an anchor so the model does not drift too far while adapting to the corpus. We test two variants based on the corpus material used for the next-token objective.
CPT(code): continual pre-training on the DSPy codebase, which contains 459k tokens, mixed with SFT on MBPP traces.CPT(doc): continual pre-training on the DSPy documentation, which contains 160k tokens, mixed with SFT on MBPP traces. The two data sources are shuffled together during training.Supervised fine-tuning + on-policy self-distillation. Another way to use the corpus is to turn it into synthetic supervision. Since we do not have gold QA data, we generate question-answer pairs conditioned only on the DSPy codebase. The generated data contains two kinds of questions: deterministic questions derived from AST or syntax structure, which mostly ask about code organization and API structure, and more free-form questions about the library. To improve the quality of the questions, we use a larger model, DeepSeek-V4-Flash, as the question generator. This is not a fully self-contained study procedure, since it uses an external stronger model, but it gives a useful baseline for how far synthetic supervision can go.
Due to time constraints, this large-scale training is done with reasoning turned off. Since we still want the resulting model to be comparable as an instruction-following and reasoning model, we add a recovery stage afterward. Following the Thinking Machines Lab recipe, we run on-policy distillation with reverse KL on a roughly 60K-example mixture from Tulu3, OpenThoughts, and MBPP. The goal of this stage is to recover general reasoning, instruction following, and coding ability after the synthetic-supervision stage.
These are not meant to be the answer to Machine Studying, and as we will see, they do not solve the problem. Thus existing way of turning a corpus into expertise do not work very well yet.
The results are largely what we should expect from the objectives. Continual pre-training does not give the model much useful expertise on the benchmark. Whether we train on the DSPy codebase or on the DSPy documentation, we do not see meaningful gains under the stricter grading. This is not very surprising. The objective is next-token prediction over the corpus, so it may help the model become more familiar with the surface form of the code or documentation, but the benchmark asks for something more specific: use the library correctly inside an agentic coding task.
The synthetic SFT + on-policy distillation baseline behaves more like what we would expect. It does learn something. In the no-tool setting, the model gains a lot of performance, which means the training is not doing nothing. The model is able to remember some facts and patterns from the synthetic QA data, and it can still use the harness and reason roughly as before. In this sense, the recipe is working. The recovery stage also seems to do what it is supposed to do, namely keep the model from losing too much of its instruction-following and reasoning behavior.
However, under the stricter grading, this still does not translate into much higher performance. More importantly, when we put the SFT + OPD model back into the ReAct harness, the improvement does not really compound with inference-time search. This is the key observation. The model becomes better in the way the training objective predicts, but not in the way we would want from studying. It can answer more things closed-book, but it does not reliably become better at using the corpus through the agentic loop.
Looking at trajectories makes this clearer (Do i still want to do this?). The trained model does behave differently. Compared with the base model, it searches more often for terms that appear inside the DSPy codebase, and it sometimes guesses names or concepts that are more corpus-specific.
This also helps explain why many context-distillation-style methods may be limited in this setting. A lot of recen methods differ in mechanism: some use LoRA, some use KV-cache, some use hypernetworks to generate more LoRA. But if we look at the objective, many of them are secretly asking for a similar thing. They create an expert trajectory by putting useful context into the prompt, and then train another system to imitate that behavior without the same context. That objective is not wrong. In simpler settings, it can work very well. I
However, here the tasks become much harder if the model is not only asked to internalize a fixed prompt or a style. It has to internalize the right context conditional on the question. A toy version makes this easier to see. Suppose the context says “the answer to question one is A,” and another context says “the answer to question two is B.” If we remove the context later and ask question one, perfect context distillation would require the model to recall the question-one sentence, not the question-two sentence, and recall it correctly. If it recalls the wrong sentence, or recalls the right sentence with the wrong answer, the output fails.
There are two requirements hiding here. First, the model needs very high-fidelity memorization of whatever it has internalized. If the internalized statement is wrong by even a small amount, the answer can be wrong. Second, the model needs compositional generalization over when to recall which internalized piece. If the question asks something indirect, or requires combining multiple pieces of internalized context, the model has to retrieve the right pieces internally and compose them correctly. In a codebase, this becomes much harder than the toy example. The model is not memorizing ten clean sentences. It is trying to absorb hundreds of thousands of tokens of code and use them in many possible downstream tasks.
From this perspective, the performance of synthetic SFT is not surprising. If a model goes from very low accuracy to noticeably higher no-tool accuracy, that is already evidence that it internalized something. But asking it to reach robust repo-level expertise through synthetic QA is asking for both near-perfect memorization and strong compositional generalization. That is a much stronger requirement than ordinary fine-tuning. It is also why the scale starts to look strange. In our setup, we generated around 350k synthetic question-answer pairs over a DSPy corpus of roughly 460k tokens. That is almost 0.75 synthetic questions per corpus token. This sounds like an enormous amount of supervision, and yet it still does not give the kind of expertise we want. The problem is not simply that we need to generate “higher quality” questions. In Machine Studying, we do not know the downstream task distribution. We only have the declarative corpus. We can ask a model to generate plausible questions from the corpus, and we can try to make those questions diverse and grounded, but we cannot know in advance all the ways future users will ask the system to use the code.
This does not mean synthetic supervision is useless. It clearly moves the model. But it suggests that the obvious weight-based recipes are not enough. Continual pre-training can expose the model to the corpus. Synthetic SFT can teach the model to answer many generated questions. Context distillation can compress some privileged-context behavior. But none of these, at least in this form, reliably turns the corpus into agentic expertise. They do not give the model the stable ability to search, select, and compose the right pieces when the task is new.
Add caption / analysis here.
A second baseline we study is the cheatsheet: the model first explores the codebase and writes itself a short script capturing the structure, then is tested with that script in context. Plotting score against inference compute (tokens), the cheatsheet shifts the DSPy frontier up and to the left—higher score for less compute—while on OpenClaw the effect is muted, which makes it a useful baseline for a studying method.
(add more)
Add caption / analysis here.
The second benchmark studies the task of writing literature reviews. This is a useful case study because many people already use AI systems for research, and I personally think AI for research is already extremely powerful. But I also think a lot of this capability comes from the model having already internalized a large amount of very recent scientific literature during pretraining. So the natural question is what happens when the literature moves past the model’s knowledge boundary. If we take a model whose knowledge cutoff is around September 2024, and ask it to write a literature review for papers from 2026, how much can it recover with tools? Can search compensate for being behind the current literature?
The setup is as follows. We collect papers from recent machine learning conferences (ICLR, CVPR, ICML, NeurIPS) from 2018 to 2025 and use them as the searchable literature corpus. We then take a set of 2026 target papers from ICLR 2026, and ask the model to write a related-work style literature review for each target paper based on title and its abstract. The model can interact with the literature corpus through BM25 search. Each search query returns up to 20 unique papers, with titles and abstracts. The model can search for up to 20 turns, inspect the returned papers, and then write a literature review.
For each target paper, there are two sets of papers we care about. The first set is the papers that the target paper actually cites and that also appear in our corpus. This measures whether the model can recover literature that the authors themselves used. The second set is a set of must-cite papers, constructed using an external labeling procedure. (cite that paper) This is meant to capture papers that should appear in a good literature review even if the exact citation list is noisy. In both cases, the model’s job is not only to write fluent prose. It has to search the literature, decide what matters, and select the right papers.
A useful distinction here is between reach and selection. Across 20 search turns, the model may see many papers. Since each query returns up to 20 papers, the loose upper bound is around 400 papers, although in practice the unique papers reached is around 230. We call the set of papers the model has seen its reached set. We can ask how many relevant papers ever appear in this reached set. This measures whether the model can find the right papers at all. Then, at the end, we ask the model to select up to 100 papers to use in the final literature review. Recall@100 measures which papers it actually chooses to keep. This measures judgment after retrieval.
In preliminary results, GPT5.1 and GPT5.5 reach a surprisingly similar fraction of the relevant literature. Even though GPT5.1 has an older knowledge cutoff, BM25 search lets it bring many of the right papers into view. For both the citation set and the must-cite set, the reach can be around 60%. So the older model is not simply failing because it cannot retrieve the relevant papers. The papers are often in front of it.
The difference appears at selection time. When the model has to choose the final 100 papers, GPT5.5 does much better than GPT5.1. The older model is especially worse at selecting papers from 2024 and 2025.
This also shows up qualitatively. In one example, we ask the model to write a literature review for a recent prompt-optimization paper. The older model drifts toward older optimization literature such as DPO, PPO, GRPO, and other reinforcement-learning or preference-optimization methods. This is understandable. Around the model’s training era, when one talked about optimization for language models, those were highly salient directions. But for the newer target paper, that is not necessarily the right map of the field. The model has access to search, but its judgment is still anchored in the literature it already knows.
This is the same pattern as the coding benchmark, but in a research setting. The model can retrieve many relevant items, and sometimes it can even reach the same papers as a newer model. The failure is not simply access. The failure is selection, emphasis, and organization. A model that is behind the literature may find the right papers but still not know which ones are the right backbone for a literature review. This is exactly where Machine Studying should matter. We would like a system to spend time with the new literature before the task arrives, so that when it later writes a literature review, it does not only search well. It also knows what the recent field has come to consider important.
Add caption / analysis here.
At the beginning, I defined machine studying, its goal, and its potential forms. Now I want to define the two derived evaluation objects that correspond to this setup.
So far, a benchmark score hides how much work was needed to obtain it. So instead of treating performance as a single point, we look at the complete performance-inferece-compute frontier. For a fixed task, we can draw performance as a function of inference-time compute, usually with compute shown on a log scale. Broadly, a system is more ‘expert’ when this curve is shifted up and to the left: it reaches higher performance at the same compute, or the same performance with less compute.
And to be able to compare different system, we need to turn the curve into a number, one way of doing it is to calculate a weighted area under the curve (WAUC):
\[\text{Expertise} = \int \text{performance at an inference budget} \times \text{importance of that budget}\]The importance weight is larger for budgets we actually care about and smaller for budgets that are technically possible but too expensive to matter much. A system that only becomes good after enormous search may have high performance, but it should not have high expertise.
This figure shows four imaginary curves. The ordinary curve rises gradually as more inference compute is spent. The expert curve is shifted up and left, showing a system that is better at nearly every budget. The SFT-like curve starts strong but flattens, indicating what we have seen during earlier when we train models on synthetic dspy questoions, where such supervision can buy familiarity and improve low-compute behavior, but may not keep improving as the agent gets more time to search. The brute-force curve may represent a very agressive harness such as ‘read-every-single-file-before-answering’, which starts weak and only rises far to the right, showing a system that can eventually solve the task but only by spending too much test-time compute. With a function defining how much each inference time budget matters, we can convert these curves into numbers.
Of course in order to make the math works requires some more rigorous definition, here is one way to look at it in a more formal way.
The weighted-area score is just a weighted average over the compute curve. In the continuous version,
$$\mathcal{E}(\Sigma; \mathbf{D}) = \int p_{\Sigma,\mathbf{D}}(x)\, w(x)\, dx, \qquad w(x) \geq 0, \quad \int w(x)\, dx = 1.$$Here $p_{\Sigma,\mathbf{D}}(x)$ is the performance of system $\Sigma$ on domain $\mathbf{D}$ at position $x$ on the log-compute axis. The normalization makes the score stay on the same scale as performance: if $p_{\Sigma,\mathbf{D}}(x) \in [0,1]$, then $\mathcal{E}(\Sigma; \mathbf{D}) \in [0,1]$. The weight $w$ says which compute budgets matter. (For the illustrative curves here, one natural choice is a decreasing weight on the log-compute axis, such as an exponential decay; this lets the curve extend over $(0,\infty)$ in raw compute while still giving a finite score, because the far-right tail receives vanishing weight.)
For a more practical, finite evaluation we only run a handful of budgets, the WAUC becomes a finite sum. Writing $b_i$ for the weight placed on the $i$-th budget,
$$\mathcal{E}(\Sigma; \mathbf{D}) = \sum_i b_i\, p_{\Sigma,\mathbf{D}}(x_i), \qquad b_i \geq 0, \quad \sum_i b_i = 1.$$As an example, suppose the raw inference budgets are
$$1\text{k},\;10\text{k},\;100\text{k},\;1\text{M},\;10\text{M}$$tokens. On a base-10 log-compute axis, these correspond to positions
$$3,\;4,\;5,\;6,\;7.$$Then a weighted expertise score can be
$$\mathcal{E}(\Sigma; \mathbf{D}) = 0.35\,p_{\Sigma,\mathbf{D}}(3) + 0.30\,p_{\Sigma,\mathbf{D}}(4) + 0.20\,p_{\Sigma,\mathbf{D}}(5) + 0.10\,p_{\Sigma,\mathbf{D}}(6) + 0.05\,p_{\Sigma,\mathbf{D}}(7).$$This weighting emphasizes very cheap performance. A different benchmark might treat tool-use as a requirement and believes that no-tool-call performance to be too brittle, they may define the weight as
$$(0.10,\;0.25,\;0.35,\;0.20,\;0.10),$$which puts most of the score around the 100k-token region.
The intelligence score (which will be introduced next) uses the same construction one level up.
Studying gives a second curve. For each amount of study compute, we first evaluate the resulting system’s full inference-time curve and convert that curve into an expertise score. Then we draw expertise as a function of study compute. The WAUC of this second curve is the (studying) intelligence:
\[\text{Intelligence} = \int \text{expertise after a study budget} \times \text{importance of that budget}\]Expertise measures how much useful performance appears across the inference budgets we care about. Intelligence measures how efficiently study compute moves that expertise curve upward.
This figure shows that different studying methods spend compute in very different ways. CPT-like exposure is cheap but nearly flat: the model sees the corpus, but does not turn it into much usable agentic expertise. Synthetic SFT should appear farther to the right, because generating and training on hundreds of thousands of question-answer pairs is expensive relative to the size of the corpus. It can raise expertise, especially at low inference budgets, but it may plateau if the learned behavior does not compound with search. A cheatsheet-like method should rise earlier because it gives the agent a compact structure that helps it search and select during inference. Retraining from scratch should sit far to the right: it may eventually produce high expertise, but only after pretraining-scale compute. The desired Machine Studying curve is the missing object: it rises early and keeps rising, because it converts a corpus into reusable expertise rather than shallow familiarity.
Here are some more articles you might like to read next: