Codex Spark Is OpenAI's Bet That Speed Matters More Than Smarts

Yesterday, OpenAI released GPT-5.3-Codex-Spark. The name is a mouthful, but the pitch is simple: it’s a coding model that’s fast enough to feel like autocomplete.

We’re talking 1,000+ tokens per second. Fifteen times faster than the standard Codex model. Fifty percent less time waiting for the first token to appear. If you’ve ever used an AI coding tool and felt that awkward pause — the one that breaks your concentration while the model thinks — Spark is designed to kill it.

But the speed isn’t the real story. The real story is what’s running it.

Not Nvidia

Codex Spark is OpenAI’s first production model that doesn’t run on Nvidia hardware.

Let that register for a second. OpenAI — the company most responsible for turning Nvidia into a $3 trillion juggernaut — just shipped a product on a competitor’s silicon. Specifically, Cerebras Systems and their Wafer-Scale Engine 3, a chip that looks like something out of science fiction: a single processor the size of a dinner plate, packed with hundreds of thousands of AI cores and enough on-chip memory to hold the entire model without shuffling data back and forth.

That last part matters. The reason GPU-based inference has latency is that the model doesn’t fit entirely in the GPU’s memory, so data has to move around constantly. Cerebras eliminates that bottleneck by keeping everything on one massive chip. The result: tokens come out fast enough that coding with it feels qualitatively different from every other AI tool on the market.

OpenAI signed a $10 billion deal with Cerebras in January. They’ve also cut a six-gigawatt deal with AMD and a co-development agreement with Broadcom. The message to Nvidia is clear: we’re diversifying, and your margin is our opportunity.

What It Actually Does

Strip away the hardware drama and Codex Spark is a distilled version of GPT-5.3-Codex — the full-size model that launched last week and currently sits at or near the top of most coding benchmarks.

Spark trades some of that raw capability for speed. On SWE-Bench Pro, it scores 56.8% — solid but well behind the full Codex model and behind Claude Opus 4.6’s 80.8% on SWE-bench Verified. On Terminal-Bench 2.0, it hits 77.3%. It beats GPT-5.1-Codex-mini across the board while completing tasks in a fraction of the time.

The context window is 128K tokens — generous, but a quarter of Opus 4.6’s 1M token window. For most real-time coding tasks, 128K is plenty. For the kind of sprawling multi-file refactoring where you need the model to hold an entire codebase in its head, you’d want the full Codex or Claude.

Where Spark shines is the stuff you do fifty times an hour: editing a function, tweaking CSS, fixing a type error, asking “what does this do,” revising a plan, trying a different approach. The tasks where waiting three seconds for a response genuinely hurts your flow, because you were going to type something else but now you’ve lost the thread.

OpenAI’s framing is explicit about this. They’re positioning Spark for “conversational coding, not slow batch-style agents.” The full Codex model handles the heavy lifting — background agents that run for minutes, crunching through large refactors or multi-step engineering tasks. Spark handles the back-and-forth. It’s the difference between sending an email and having a conversation.

The Two-Tier Future

This dual approach — a big slow model for deep work, a small fast model for real-time interaction — is probably where the entire industry lands.

Think about how developers actually work with AI coding tools today. Sometimes you need Claude or Codex to spend five minutes thinking through a complex architecture change. But most of the time, you need it to quickly edit a file, answer a question, or suggest a fix. Those are fundamentally different workloads, and it makes sense to run them on fundamentally different hardware.

Anthropic seems to be heading the same direction. Claude Opus 4.6 is the deep thinker; Haiku is the fast responder. Google has a similar tiered approach with Gemini. OpenAI is just the first to make the hardware story explicit — purpose-built silicon for purpose-built models.

The implication for developers is that the tools are going to get noticeably more responsive in the next six months. Not because the models got smarter, but because they got faster. And in an interactive coding context, speed and intelligence aren’t always distinguishable. A model that gives you a good-enough answer in 200 milliseconds often beats a model that gives you a perfect answer in 8 seconds.

The Nvidia Question

The business story here is arguably bigger than the product story.

Nvidia has had a near-monopoly on AI training and inference hardware. Every major AI company runs on their GPUs, and the demand is so intense that H100s and B200s are perpetually backordered. Nvidia’s data center revenue hit $115 billion last year. Their margins are the envy of the semiconductor industry.

OpenAI just demonstrated, publicly, that you can build a competitive production AI product on non-Nvidia hardware. And not just competitive — faster on the specific workload it targets.

This doesn’t mean Nvidia is in trouble. GPUs are still the default for training, for throughput-optimized inference, and for the vast majority of AI workloads. OpenAI itself said GPUs “remain our priority for cost-sensitive and throughput-first use cases across research and inference.”

But it opens a crack. If Cerebras can win the latency-sensitive inference workload, and AMD can win some of the cost-sensitive training workload, and Broadcom can win custom accelerator designs — suddenly Nvidia’s position looks less like a monopoly and more like a market share leader in a fragmenting market. That’s a very different investment thesis.

Who Should Care

If you’re a ChatGPT Pro subscriber ($200/month): You can try Spark right now in the Codex app, CLI, and VS Code extension. It’s a research preview, so expect rough edges.

If you’re deciding between AI coding tools: Spark doesn’t change the fundamental comparison. For deep coding work — complex bugs, multi-file refactors, architectural decisions — Claude Opus 4.6 and the full GPT-5.3-Codex are still the heavyweights. Spark’s advantage is narrower: real-time interaction speed. If you spend most of your time in rapid edit-test cycles and the latency of current tools bothers you, Spark is worth trying.

If you care about the AI hardware market: This is the most interesting development in months. A top-3 AI company shipping a production model on non-Nvidia silicon validates an entire category of specialized inference hardware. Watch for other companies to follow.

The Takeaway

Codex Spark isn’t the smartest coding model available. It doesn’t pretend to be. It’s a deliberate trade: less raw capability for dramatically more speed, running on purpose-built hardware that can deliver tokens faster than you can read them.

That trade-off reveals something about where AI coding tools are heading. The era of “one model does everything” is ending. The future is a stack: fast models for real-time work, powerful models for deep work, specialized hardware matched to each. OpenAI is building both layers. So is Anthropic. The question isn’t which model wins — it’s which combination of models and hardware gives developers the best experience across every type of coding task they encounter in a day.

And quietly, beneath all of that, a chip company in Sunnyvale just proved that there’s life beyond Nvidia. That might end up being the most consequential part of this entire announcement.

Sources: