Back to projects

Measuring Active Intelligence

Measuring skill acquisition efficiency via Active Querying

ARCBenchmarkActive Inference

The Abstraction and Reasoning Corpus (ARC) was introduced by Chollet as an exemplar benchmark aligned with his proposed definition of intelligence: the efficiency with which an agent acquires new skills at tasks drawn from an unknown distribution, given prior knowledge and limited experience. ARC has been profoundly influential in the study of reasoning-centric AI, particularly because it resists rote pattern recognition and requires abstract, compositional reasoning. Yet, despite its impact, ARC in its current form exhibits several limitations that weaken its capacity to serve as a faithful operationalization of the philosophy behind On the Measure of Intelligence.

The first limitation concerns the sparsity of the benchmark’s signal. Performance is recorded as binary task success or failure, aggregated across tasks. As a result, small improvements in reasoning ability are invisible until they cross the threshold of solving a task. This creates a low-resolution, noisy measure that fails to capture incremental scientific progress. Researchers may invest substantial effort in developing new inductive biases, only to find no visible improvement on the leaderboard until a discontinuous jump occurs. For a benchmark intended to steer a research field, such coarse resolution is unsatisfactory.

The second limitation lies in the measurement of efficiency. In principle, ARC was designed to highlight efficiency of skill acquisition, yet the current evaluation metrics emphasize raw success rate and, occasionally, computational cost measured in FLOPs or dollars. These quantities do not correspond to the true scarcity in modern AI problems: the cost of experience. Many real-world tasks, from robotics to scientific discovery, are constrained by the availability of informative examples rather than by raw computation. ARC encodes scarcity implicitly by providing only a handful of demonstrations, but it does not quantify how many experiences an agent requires to succeed. Thus, it cannot distinguish between a learner that extracts maximal information from a few examples and a data-hungry learner that would, in principle, require millions of interactions to achieve the same outcome.

The third limitation concerns passivity. ARC provides fixed demonstrations. The agent cannot choose what experience to seek. Yet one of the hallmarks of intelligence is precisely the ability to ask the right questions: to design experiments that invalidate incorrect hypotheses and compress the search space of possible rules. From the scientific method to everyday learning, intelligence manifests as the ability to probe the world actively. In its current form, ARC cannot measure progress along this axis. A system that strategically designs sharp queries is indistinguishable from one that merely memorizes patterns, except for occasional abrupt performance jumps caused by the scarcity of training data.

To address these shortcomings, we propose Active-ARC, a modification of the benchmark that introduces active querying while preserving the symbolic, abstract nature of the original tasks. In Active-ARC, the agent is initially presented with a single input-output example of a task. It then enters an interactive phase where it may propose new input grids of its own design. For each query, the environment returns the output under the hidden transformation rule. The agent decides when it has gathered sufficient evidence and transitions to test mode, at which point it must produce the correct output for a held-out test input. Crucially, performance is not judged merely on correctness but also on the number and complexity of queries made before submission.

The evaluation metric is designed to reflect skill acquisition efficiency more directly. Each query incurs a cost, composed of a fixed term and an additional penalty proportional to the description length of the query input. This discourages trivial strategies such as filling the grid with exhaustive noise and incentivizes parsimonious, hypothesis-driven experimentation. The final score is aggregated as the mean and variance of queries required across tasks. This provides a continuous, high-resolution signal that reveals small improvements in reasoning and can highlight qualitative jumps from passive to active learning.

Implementing Active-ARC raises technical challenges. The most immediate is the construction of reliable oracles. While existing DSL-based solutions to ARC tasks can serve as ground-truth programs, careful engineering is required to prevent leakage of internal representations. Equally challenging is the treatment of invalid queries. If a transformation is defined in terms of holes in objects, for example, then an input without holes may produce an undefined state. Whether the oracle should return a null output, identity, or special marker requires theoretical justification, as each choice carries the risk of inadvertently leaking information. Another challenge lies in protecting the benchmark from gaming. While adversarially crafted queries could, in principle, exploit weaknesses in the oracle, a more fundamental danger is that anti-gaming mechanisms themselves become secondary oracles that provide unintended signals. In line with lessons from computer security, the safest path is to favor simplicity: minimalistic interfaces, blind testing against hidden tasks, and server-side evaluation, rather than increasingly complex safeguards that expand the attack surface.

From an information-theoretic perspective, Active-ARC reframes ARC as a problem of experimental design. Each query constitutes a channel through which information about the hidden rule flows to the agent. The measure of intelligence becomes the efficiency with which the agent reduces its uncertainty about the hypothesis space per unit cost. This aligns with both modern information theory and the epistemology of science, where progress is made not by passive observation but by sharp falsification of hypotheses. In this view, Active-ARC restores fidelity to Chollet’s definition by quantifying exactly how efficiently an agent converts scarce experience into new skill.

Human performance provides an essential baseline. Empirical observations suggest that humans typically require only a few examples—often three to five—to grasp the rule underlying an ARC task. Measuring human query efficiency within the Active-ARC protocol would ground the metric in cognitive realism and highlight the gap between artificial and natural learners. While such studies are resource-intensive, even small-scale experiments would substantially strengthen the legitimacy of the benchmark. For an initial proof-of-concept, simulated baselines may suffice, but eventual human evaluation will be critical to ensure validity.

Inevitably, criticisms will arise. Some may argue that Active-ARC is merely a marginal extension of ARC. We contend that it introduces a qualitatively new dimension: the ability to assess not only whether an agent solves a task but how efficiently it arrives there. Others may question tractability. Yet by leveraging existing DSL programs, a proof-of-concept implementation is well within reach. Concerns about hackability are real, but simplicity and blind testing offer a pragmatic path forward. Finally, skeptics may ask whether Active-ARC is truly better than ARC-3, which also introduces interaction. We argue that ARC-3 shifts the focus toward sequential planning in observable state spaces, losing the abstract, implicit transformations central to ARC’s philosophy. Active-ARC, in contrast, preserves the symbolic essence of ARC-1/2 while extending it into the active learning regime.

In conclusion, Active-ARC offers a benchmark that is more tightly aligned with the philosophy articulated in On the Measure of Intelligence. By operationalizing skill acquisition efficiency through active queries, it provides a smoother, more informative research signal, better discriminates between passive and active learners, and remains faithful to the spirit of ARC as a measure of abstraction and reasoning. We do not claim it as a final measure of intelligence, but, like ARC itself, as an exemplar benchmark—an illustrative milestone—that may guide the community toward agents capable of learning as efficiently and inquisitively as humans.