v4.2.0 // 2026.04 // The Intelligence Issue

INTERRUPT

Dispatches from the stack.

The Thinking Machine: How Large Language Models Rewired Cognition

01 The Architecture of Intelligence
02 Tokens All the Way Down
03 The Multimodal Turn
Scroll to explore
INTERRUPT // v4.2
The Intelligence Issue April 2026 ISSN 2987-4419
Table of Contents

The Architecture of Intelligence

Inside every large language model lies a architecture that was, at its moment of invention, deeply unfashionable. The transformer was proposed in 2017 by a team at Google who were trying to solve a narrow problem in machine translation — sequence-to-sequence mapping — and had no idea they were handing the field its most consequential abstraction in decades. Eight years later, the transformer underlies nearly every serious AI system in deployment, and the race to understand, extend, and eventually surpass it has become the central engineering problem of the age.

The core insight is deceptively simple. Previous sequence models — recurrent neural networks, LSTMs, GRUs — processed tokens one at a time, maintaining a hidden state that carried information from earlier in the context. This made them slow to parallelize and prone to forgetting long-range dependencies. The transformer replaces this sequential bottleneck with an operation called attention: every token in a sequence can directly attend to every other token simultaneously. The computational cost increases quadratically with context length, but the parallelism makes it tractable on modern hardware, and the direct connections mean no information has to survive the journey through a compressed hidden state.

"The transformer did not just change how we compute. It changed what we thought was worth computing on."
— from a talk at NeurIPS 2025

What followed the vanilla transformer was a period of intense architectural experimentation. Researchers tried sparse attention patterns, linear approximations to the quadratic attention computation, state-space models that maintained continuous dynamical systems as memory, and hybrid architectures mixing transformers with convolutional layers. Some of these alternatives showed genuine promise on specific tasks or efficiency regimes. But none displaced the transformer from its position as the default architecture for anyone training a new large model from scratch. The reasons are partly empirical — transformers simply performed better at scale — and partly path-dependent: the entire ecosystem of training infrastructure, optimisation techniques, and architectural intuitions had built up around them.

The current frontier is concerned with a different set of constraints. Raw parameter count has ceased to be the primary bragging right; training data quality and composition, inference-time compute strategies, and context length have moved to the centre of the conversation. Reasoning models — systems trained to generate extended chains of thought before producing a final answer — represent a qualitative shift in how the transformer architecture is being used, even if the underlying mathematics is unchanged. The question that looms over 2026 is whether the next leap requires a fundamentally new architecture or whether the transformer still has unspent inheritance that careful engineering can unlock.

The honest answer is that nobody knows. What is clear is that the field has become genuinely humble about its predictive power. The researchers who built the transformer were not trying to create general intelligence. The researchers who scaled it to billions of parameters were not certain it would generalise. The researchers who found that scaling laws predicted smooth improvements with compute were surprised that the curve held as long as it did. At each stage, the field has been simultaneously more successful and less predictive than anticipated. The Architecture of Intelligence, in other words, is not a solved problem. It is a live research programme whose endpoint remains genuinely uncertain.

Tokens All the Way Down

Before a language model can process text, that text must be converted into a sequence of numbers. This is tokenisation: the process of breaking raw input into discrete units that the model can manipulate mathematically. It sounds like a preprocessing footnote. In practice, tokenisation decisions ripple through every behaviour the model exhibits — including some of its most consequential failure modes.

The most widely used tokenisation schemes today are subword-based. Rather than treating each character as its own token (computationally expensive and semantically impoverished) or each word as a token (impossible to handle out-of-vocabulary terms, and inefficient for morphologically rich languages), subword tokenisers learn a vocabulary of between roughly 30,000 and 100,000 tokens from a large training corpus. Common words might occupy a single token; rarer words are decomposed into subword fragments. The word "intelligence" might tokenise as ["int", "elligence"], while "unprefixed" becomes ["un", "##prefixed"].

A language model is only as good as the alphabet it learns to think in — and that alphabet is not letters, not words, but tokens.
— INTERRUPT, Issue 3

This seemingly innocuous choice has observable consequences. Languages with compact tokenisation — those with high information density per token, like Chinese or Japanese characters — tend to require more tokens to express the same semantic content as English, and models often perform worse on them not because of architectural bias but because of this tokenisation asymmetry. Code presents another revealing case: many tokenisers were not trained on substantial code corpora, leading to strange behaviours where semantically related code tokens are split differently from their natural language counterparts, and mathematical expressions may be tokenised in ways that obscure their structural similarity.

The move from token-level to sequence-level representations — the embedding layer that converts each token into a high-dimensional vector — is where the semantic magic happens. Each token occupies a point in a space of several thousand dimensions, and the geometry of this space encodes statistical regularities about how tokens relate to each other in the training data. "King" minus "Man" plus "Woman" approximates "Queen" not because the model understands gender, but because the training corpus happened to contain enough instances of those relationships that the vector arithmetic emerged naturally. Whether this constitutes genuine understanding or an extremely sophisticated form of autocomplete remains, after six years of debate, unresolved.

What has changed is the ambition with which researchers are now probing these representations. Tools for mechanistic interpretation — methods that trace how information flows through specific attention heads or MLP layers — have matured considerably, and are beginning to produce actionable insights about where models store factual associations, how they perform multi-step reasoning, and why they sometimes generate confident errors. The token is not the meaning. But understanding what the model does with its tokens is becoming a prerequisite for understanding the model at all.

The model absorbed the internet. Now what?

The Multimodal Turn

The canonical language model ingests tokens and emits tokens. This framing served the field well through its first decade, but it always undersold the ambition. Intelligence in biological systems is not modality-specific; perception, reasoning, and language are deeply intertangled from early stages of processing. The multimodal turn in AI — the integration of vision, audio, and other sensory modalities into unified models — represents an acknowledgment that the field had been asking a narrower question than it needed to.

The technical approaches to multimodal integration vary considerably. Early systems used separate encoders for each modality — a vision transformer processing images, a separate language backbone processing text — connected by a learned projection layer that mapped the visual representations into the textual embedding space. This approach worked well enough to produce impressive demos, but carried an architectural assumption that vision was a secondary modality to be translated into language rather than a co-equal representational system. Later models — beginning with GPT-4V and continuing through systems like Gemini and Claude's vision capabilities — trained end-to-end on mixed modality data, allowing the model to develop joint representations that are not decomposable into a vision part and a language part.

The implications for capability are significant. Models trained jointly on images and text develop visual reasoning abilities that are not easily extracted from text-only training, even when the text describes the same visual content in detail. They can interpret charts, reason about spatial relationships in diagrams, read handwriting, and — increasingly — perform tasks that require fine-grained visual discrimination that would be laborious or impossible to specify in language alone. The question of whether these abilities constitute genuine visual understanding or a very sophisticated pattern-matching over tokenised image patches remains genuinely contested.

Audio integration has proceeded more slowly, in part because the temporal structure of audio creates different challenges than the relatively static structure of images. Spoken language is not the same problem as text; it includes paralinguistic information — prosody, emphasis, emotional tone — that is not easily captured in transcription. The most capable audio models today can distinguish speaker emotion, identify individual voices, transcribe overlapping speech, and understand the acoustic context of a scene. Video understanding — temporal sequences of both audio and visual information — remains the hardest unsolved problem in multimodal AI, and the one where the gap between impressive demos and reliable deployment is widest.

The deeper significance of the multimodal turn may be philosophical. If intelligence requires a model of the world that is grounded in something other than language — if the meaning of "cat" cannot be fully captured by its textual distribution — then language-only models are structurally limited in ways that more training or more parameters cannot overcome. Multimodal models are, in part, an empirical bet that this concern is overstated. That the statistical regularities in text are rich enough to build something that functions like understanding, even if it does not function in exactly the same way biological minds do. The bet is unresolved. But the models being built to test it are becoming genuinely remarkable.

Model Capability Profile
Comparative capability assessment across eight dimensions, normalised to the leading model in each category. Based on aggregate benchmark data through Q1 2026.

Benchmark Wars: The Numbers That Moved Markets

Every serious AI company now has a preferred benchmark suite, a set of standardised tests that their model is designed to excel at and that they publish results for when the numbers are favourable. This is not a neutral evaluation methodology — it is a communications strategy, and it has become one of the primary mechanisms through which the industry signals capability, attracts investment, and positions itself relative to competitors. Understanding the benchmark wars is understanding the political economy of AI in 2026.

The MMLU benchmark — Massive Multitask Language Understanding, covering 57 subjects from elementary mathematics to professional law — became the field's consensus evaluation standard around 2023, and has maintained that position through successive model generations despite persistent concerns about its suitability as a general intelligence measure. Its appeal is practical: it is easy to run, covers a wide surface area of knowledge, and produces a single percentage score that is easily compared across models. Its weakness is equally well-documented: it is a multiple-choice test, which means it rewards familiarity with the format and the ability to recognise correct answers, not the ability to generate novel solutions or reason through genuinely novel problems.

When a benchmark becomes a communications tool rather than an evaluation tool, the relationship between the score and actual capability becomes complicated.
— Mara Voss, Editor in Chief

HumanEval, introduced by OpenAI in 2021 to evaluate code generation ability, spawned an entire cottage industry of code-focused benchmarks — LiveCodeBench, BigCodeBench, EvalPlus — each adding layers of contamination controls, additional test cases, or domain-specific variants. The proliferation of code benchmarks is both a sign of the task's importance and a symptom of the evaluation instability that occurs when benchmarks become targets: as models train specifically on problems similar to those in the benchmark, the scores inflate even when general capability is not improving at the same rate. Recent work on "hardened" evaluations, which introduce adversarial variants or novel problem structures, consistently finds substantial gaps between reported benchmark scores and performance on genuinely novel coding problems.

The financial stakes have raised the stakes of benchmark performance in ways that are visible in the research literature. In 2024 and 2025, it became common for companies to commission independent evaluations of their models from third-party evaluators before publishing results — not because the internal evaluations were unreliable, but because the reputational cost of a significant gap between internal and external scores had become material. Several high-profile benchmark score discrepancies in 2025 prompted renewed discussion about evaluation methodology and the conditions under which benchmark scores can be taken at face value. The consensus that has emerged is uncomfortable: benchmark scores are useful directional signals, poor absolute capability measures, and deeply unreliable as the sole basis for comparing systems.

What benchmarks do reliably measure is resource allocation. The correlation between benchmark performance and venture capital investment in AI infrastructure is not perfect, but it is striking: the companies that consistently top the leaderboards are also the companies that have raised the most money to build the compute infrastructure necessary to train the models that top the leaderboards. Whether this virtuous cycle produces genuinely better AI or better-marketed AI is a question the benchmarks themselves cannot answer.

Benchmark Scores by Model — 2025–2026

What Comes After the Frontier?

The scaling era in AI is not over, but its character has changed. For roughly five years, the dominant research strategy was straightforward: take a transformer architecture, increase the number of parameters, increase the amount of training data, increase the compute budget, and observe that the resulting models were better across nearly every evaluation dimension. This relationship — smoother, more predictable, and more generous than anyone had a right to expect — made scaling a reliable path to capability improvements and gave the field a clarity of direction that is now, in retrospect, unusually fortuitous.

The complications that have emerged are not failures of scaling so much as its successors. Test-time compute — the idea that a model should be able to spend more processing resources at inference time to produce better answers, rather than requiring all reasoning to happen during training — has become the primary frontier of research for reasoning-focused models. Systems like OpenAI's o-series and DeepSeek's R1 demonstrated that explicitly training models to generate extended reasoning traces, then using reinforcement learning to reward good chains of thought, produced dramatically better performance on problems that required multi-step reasoning. The key insight was that the amount of computation devoted to a problem did not have to be fixed at training time.

The next leap may not require a new architecture. It may require rethinking what computation is for.
— INTERRUPT, Vol. IV, No. 2

Mixture-of-experts architectures represent a different response to the same underlying constraint: that making a model larger makes it slower and more expensive to run, even when most of its parameters are inactive for any given input. A MoE model maintains a large bank of specialist sub-networks ("experts") and uses a routing mechanism to activate only a small fraction for each input. The result is a model whose capability — measured in terms of what it can do — scales with the total parameter count, while its inference cost scales with the active parameter count. This architectural choice has become near-universal among the largest training runs, and is beginning to propagate into smaller models where the trade-off is less extreme but still favourable.

Whether these engineering refinements are sufficient to sustain the rate of improvement that the field has become accustomed to is the defining question of the current moment. The optimists point to the continued absence of any clear ceiling — every time a new architecture or training strategy has been proposed as a potential bottleneck, subsequent work has found ways around it. The pessimists point to the growing evidence that the easiest gains from scaling have been harvested, and that the remaining capability gaps — robust physical reasoning, reliable long-horizon planning, genuine generalisation from limited examples — are precisely those that are least amenable to the brute-force approaches that have driven progress so far.

The honest answer is that the field has arrived at a region of capability space where its predictive track record is, by necessity, less reliable than it was during the scaling era. Nobody has a clear model of what the ceiling looks like or how to reach it efficiently. What is clear is that the problems that remain are not the same problems that scaling has solved, and that solving them will require ideas whose content we do not yet know. That is, in the assessment of many researchers, the most interesting situation a field can be in.

Masthead
Mara Voss
Editor in Chief
Former senior editor at MIT Technology Review. Covers AI research and the ethics of large-scale automated systems. Based in Berlin.
Theo Blackwood
Senior Correspondent
Ten years covering Silicon Valley for the Financial Times. Specialises in AI policy, compute economics, and the semiconductor supply chain.
Selin Çelik
Research Lead
PhD in computational linguistics from ETH Zurich. Writes about transformer architectures, representation learning, and interpretability research.
Dae-Jung Oh
Technical Writer
Previously at Anthropic and Google Brain. Covers multimodal models, vision-language systems, and the intersection of perception and language.
Priya Anand
Data Visualisation
Creates the charts, diagrams, and interactive data visualisations that appear throughout INTERRUPT. Former data journalist at Reuters.
Felix Drury
Copy & Fact
Responsible for editorial accuracy and style consistency across all published work. Background in academic philosophy and science communication.