Grok Build Ships Autonomous Execution: xAI Agent Now Plans, Runs, and Verifies
Market Updates

Grok Build Ships Autonomous Execution: xAI Agent Now Plans, Runs, and Verifies

Tech Times2h ago

The /goal mode uses a two-model pipeline and self-checks before any task is done.

xAI shipped a new mode called inside Grok Build on June 22, its terminal-based autonomous coding agent, giving developers a way to hand off a complex implementation task and step back entirely -- no prompting, no manual verification, no back-and-forth -- while the agent plans, executes, and confirms its own work from start to finish. The launch marks the most consequential architectural addition to Grok Build since its May debut, and it draws a sharper competitive line in a market where Claude Code and OpenAI's Codex CLI have held the initiative for more than a year.

The core change is behavioral. Standard Grok Build sessions still follow the familiar coding-agent rhythm: the developer prompts, the agent acts, the developer reviews. removes the developer from that loop for the duration of the task. The agent accepts a single natural-language objective, builds a visible progress checklist, executes each item in sequence, and continues running until it can verify that the work is actually done -- not just written.

What Makes the Verify Step Different

That verification pass is the specific addition that distinguishes from a longer-running standard session. According to xAI's official announcement, the agent can verify its own output through three mechanisms: reviewing the code it produced, inspecting web pages to confirm runtime behavior, or executing scripts to test the result directly. The test runs before the task is marked complete, not after a developer checks it.

This addresses a well-documented failure mode in first-generation coding agents: an agent edits files, reports success, and moves on -- but the change does not work. The code compiles without error yet breaks the application in ways that only a running test would catch. 's architecture pushes that test inside the agent's own execution loop, where it can trigger additional fixes before surfacing results to the developer.

Developers retain control during a run. They can send additional instructions to the agent while it works, use to see the live progress panel, pause execution with , resume it with , or cancel entirely with . When the objective is satisfied, the panel flips to a "Complete" state with every checklist item marked done.

The Two-Model Pipeline Behind /goal

does not run on a single model. According to xAI, the mode natively uses both Composer 2.5 and Grok Build 0.1 to handle different stages of the pipeline -- separating the cognitive work of planning and instruction-following from the lower-level work of code generation and execution.

Composer 2.5, added to the Grok Build CLI on June 1, is designed specifically for long-running tasks and complex instruction sequences. Grok Build 0.1 is xAI's purpose-built agentic coding model -- trained from scratch on programming content and real-world pull requests, running at over 100 tokens per second, and carrying a 256,000-token context window suited for holding large codebases in memory across multi-step tasks.

The architectural question this multi-model approach raises -- but that xAI has not yet answered publicly -- is whether the model handling verification is genuinely independent from the model handling generation. AI agent architecture research consistently identifies this as the deciding factor in whether a critic pass produces meaningful quality improvement: a verifier trained similarly to the generator tends to share the same blind spots, producing shallow self-agreement rather than genuine evaluation. Whether Composer 2.5 and Grok Build 0.1 are independent enough to catch each other's failure modes is something developers testing the feature in production will determine.

Where /goal Fits in the Competitive Landscape

Grok Build entered the autonomous coding agent market in May -- roughly a year behind Claude Code and OpenAI's Codex CLI, both of which have accumulated substantial developer adoption and track records in production engineering workflows. Google's Gemini Code Assist Enterprise is a fourth serious contender targeting enterprise teams.

The benchmark picture is unflattering for xAI's current model. Grok Build's underlying coder scored 70.8% on SWE-Bench Verified -- the industry-standard coding benchmark -- a figure drawn from the earlier grok-code-fast-1 model. Claude Code on Opus 4.7 scores 87.6% on the same benchmark; Codex CLI posts comparably strong numbers. xAI has not yet published a SWE-Bench Verified score for the production grok-build-0.1 model. Elon Musk acknowledged the gap publicly in April, saying it would take until June "to match and maybe exceed" Claude's coding performance.

The architectural argument xAI is making with is not that its underlying model has closed that benchmark gap -- it has not demonstrably done so yet -- but that long-horizon autonomous execution changes which metric matters. When an agent runs until it verifies its own output, the score that counts is not "best single attempt" but "does the final deliverable work." That argument is coherent. It is also unproven at production scale, and the developer community will stress-test it in the days ahead.

Mitch Ashley, VP and practice lead for software lifecycle engineering at The Futurum Group, summarized the market dynamic: "Coding agents are becoming the procurement front where AI labs compete to own the developer workflow. Multi-agent parallelism with built-in evaluation, paired with local-first execution, reflects vendors racing to differentiate on orchestration architecture and execution environment guarantees."

One architectural differentiator Grok Build offers that the launch coverage often underemphasizes: all code runs on the developer's machine. Nothing in the codebase is transmitted to xAI's servers during a working session. For engineering teams in regulated industries -- financial services, healthcare, government -- that local-first design is a concrete security argument that cloud-based coding tools cannot match.

Arena Mode, Grok Build's planned feature for running multiple agents in parallel against the same problem and auto-selecting the best output, was confirmed in code traces in February and included in the public launch announcement -- but it is not yet live in the current beta. If it ships as described, it would make the 17-point benchmark gap less decisive in practice, since selecting the best of multiple independent outputs is more forgiving of per-attempt model weaknesses than committing to a single result.

How to Access /goal

is available now inside Grok Build. The CLI installs with a single command and requires authentication with a SuperGrok or X Premium Plus subscription. Access tiers include SuperGrok Heavy at $300 per month, SuperGrok at $30 per month, and X Premium Plus at $40 per month -- xAI expanded access beyond the initial SuperGrok Heavy-only requirement on May 25. At $30 per month on the base SuperGrok tier, Grok Build is the lowest-priced full-featured terminal coding agent in the category.

xAI -- now operating as a division of SpaceX following the February 2026 acquisition -- continues to iterate on Grok Build at a rapid pace. The three months from Grok Build's initial beta to the launch have included the addition of Composer 2.5, the Grok Build Plugin Marketplace, and now autonomous long-running execution. Whether the next iteration closes the benchmark gap against established tools is the question the developer community is now positioned to test.

Frequently Asked Questions

What does /goal actually do differently from a standard Grok Build session?

A standard Grok Build session still follows a turn-by-turn pattern: the developer prompts, the agent executes, and the developer reviews before the next step. removes the developer from that loop for the duration of a bounded task. The agent builds its own checklist, executes each item, and runs a verification pass -- checking the code it wrote, inspecting pages for runtime behavior, or running test scripts -- before marking the task complete. The developer can send additional instructions mid-run or pause and resume execution, but is not required to approve each step.

How does the two-model pipeline work, and does it actually improve output quality?

uses Composer 2.5 for planning and complex instruction-following and Grok Build 0.1 for code generation and execution. xAI says the separation is intended to bring higher intelligence to each stage of the pipeline. Whether this constitutes genuine independent verification depends on how differently the two models were trained -- agent architecture research recommends using a verifier model that is independent enough from the generator to catch its blind spots rather than mirror them. xAI has not yet published technical details establishing that independence. Developer testing in production will be the first real signal.

How does Grok Build's /goal compare to Claude Code and Codex CLI on long-running tasks?

Claude Code and Codex CLI both support multi-step task execution, but neither ships a named autonomous mode that self-verifies before marking tasks complete in the same explicit way does. Claude Code runs on a single deep-reasoning agent with a 1-million-token context window and scores 87.6% on SWE-Bench Verified. Grok Build's underlying coder has a 256,000-token context window and scored 70.8% on SWE-Bench Verified on an earlier model; a score for the current production model has not been published. The honest comparison is: Claude Code offers greater proven depth; Grok Build is betting that parallelism and built-in verification change which metric matters in real development workflows.

What subscription is required, and does /goal run on my local machine?

requires a SuperGrok or X Premium Plus subscription. SuperGrok is $30 per month; X Premium Plus is $40 per month; SuperGrok Heavy is $300 per month with higher usage allocations. All code execution in Grok Build happens on the developer's local machine -- nothing in the codebase is transmitted to xAI's servers during a session.

Originally published by Tech Times

Read original source →
xAISpaceX