GitHub Copilot's New 'Rubber Duck' Feature Uses GPT to Review Claude's Work

GitHub added an experimental feature to Copilot CLI this month called Rubber Duck. The concept is straightforward: when a Claude model is handling your coding session, Rubber Duck brings in a GPT-5.4 model as an independent reviewer at specific moments during execution. The two models come from different AI families with different architectures, which means they fail in different ways.

In GitHub’s own testing, Claude Sonnet 4.6 plus Rubber Duck closed 74.7% of the performance gap between Sonnet and Opus on SWE-Bench Pro, a benchmark for real-world coding tasks.

How It Works

The name comes from rubber duck debugging, the practice of explaining a problem out loud (to a rubber duck, a colleague, or anyone who’ll listen) as a way to catch flaws in your own reasoning. The idea is that explaining forces you to be explicit about assumptions you’d otherwise skip.

Rubber Duck applies that to agent execution. Claude handles the coding session. At three specific points, GPT-5.4 reviews what Claude has done or is about to do:

After plan drafting, before any code changes begin
After complex implementations, before moving to the next step
Before test execution

You can also trigger a manual review at any point during the session.

The reviewer looks for things the orchestrating model tends to miss: architectural problems, subtle bugs from off-by-one errors or overwritten dictionary keys, and cross-file conflicts like a Redis dependency that got broken because another file changed. These are the kinds of errors that look locally correct but fail at integration time.

What the Tests Found

GitHub’s benchmark numbers are worth unpacking. SWE-Bench Pro scores agents on their ability to close real GitHub issues, the kind that require understanding multi-file codebases, writing tests, and not breaking existing behavior. The gap between Sonnet and Opus represents a meaningful capability difference in practice.

Closing 74.7% of that gap using a secondary reviewer is not a small improvement. It suggests the bottleneck isn’t the underlying model so much as how the agent checks its own work. Self-reflection, where a model reviews its own output, has limited value because the same biases that produced the error are also reviewing it. A second model from a different family doesn’t share those biases.

Access and Availability

Rubber Duck is in experimental mode. To enable it, type /experimental in a Copilot CLI session. It only works if you have a Claude model selected as your primary and have access to GPT-5.4, which GitHub uses as the reviewer.

The feature is currently US-only and requires a Copilot plan that includes access to the relevant models. GitHub hasn’t said when or whether it will move beyond experimental status.

Sources: htek.dev Copilot CLI weekly, Neowin, MEXC News