
Current Limits: Cowork is available via the Frontier opt-in program and lacks local computer use and third-party integrations found in standalone Claude Cowork.
Microsoft is using Anthropic's Claude to grade OpenAI's GPT homework inside Copilot. On March 30, the company announced Copilot Cowork through its Frontier early access program, alongside a new Critique feature that pits the two AI models against each other to improve research quality.
Two parallel developments underpin the launch: Copilot Cowork, which delegates long-running, multi-step tasks using Anthropic's Claude, and a Critique feature where Claude reviews GPT-generated research before it reaches users. According to Microsoft's January 2026 earnings disclosure, only 15 million paid Copilot seats exist across 450 million commercial Microsoft 365 users, a 3.3% adoption rate that underscores the pressure to demonstrate tangible value from AI tools.
Inside Copilot's Researcher agent, the new Critique in Researcher separates generation from evaluation. GPT drafts responses to research queries, and Claude then reviews them for accuracy, completeness, and citation quality before delivery. Rather than relying on a single model to both produce and assess its own output, Microsoft applies the same principle that academic peer review uses: an independent second opinion from a fundamentally different system.
Critique will become the default experience in Researcher when users select Auto in the model picker, embedding multi-model review into the standard workflow. Copilot Cowork itself operates as an orchestrator for long-running workflows within Microsoft 365. Users can initiate multiple tasks simultaneously and manage them through a new dashboard, handling everything from monthly budget reviews to calendar management and meeting preparation.
Microsoft has described Cowork as Claude Code for knowledge workers, running in a sandboxed cloud environment that keeps enterprise data within Microsoft's security boundaries. Unlike a traditional chatbot interaction, Cowork can execute tasks that unfold over hours or days, checking back with users at key decision points rather than requiring constant input.
Nicole Herskowitz, Corporate Vice President for Microsoft 365, noted that having multiple AI vendors in Copilot is only the starting point. Making the models collaborate rather than simply offering users a choice between them, she said, represents the real differentiator for the platform.
In contrast, a single-model approach leaves evaluation to the same system that produced the output. Separating the roles of drafter and critic creates a structural check that catches errors one model might consistently miss, positioning multi-model review as a quality floor rather than an optional enhancement.
According to Microsoft, the multi-model approach delivers measurable gains on research quality. Researcher with Critique turned on scores 57.4 on the DRACO benchmark, an industry standard for deep research quality based on 100 complex tasks across 10 domains.
Microsoft's internal testing places that score above Claude Opus 4.6 at 42.7 and Perplexity Deep Research at 50.4. "It is this multi-model advantage that makes Copilot different," Charles Lamanna, President of Business Applications and Agents at Microsoft, said at Cowork's initial announcement.
According to Microsoft's blog post, the Critique approach yields a 13.8% improvement over the previous single-model configuration. The largest gains fall in breadth and depth of analysis, followed by presentation quality and factual accuracy.
However, no independent third party has verified these results. DRACO evaluations were scored using GPT-5.2 as an automated judge model across five independent runs per question, raising questions about whether an OpenAI-built evaluator judging a system built partly on OpenAI technology introduces systematic bias.
Furthermore, Microsoft reported statistically meaningful improvements in eight of ten DRACO domains, with a paired t-test yielding p-values below 0.0001. Until independent researchers replicate these results using neutral evaluation models, the benchmark numbers remain a marketing claim rather than an industry-validated finding.
Separately, the company is rolling out a Model Council feature that runs Anthropic and OpenAI models simultaneously, producing standalone reports with an automated judge evaluating where the responses agree and diverge. Council gives users direct control over model comparison, letting them see how different systems approach the same research query before choosing which output to use.
Only 3.3% of Microsoft's commercial user base pays for Copilot, and the company needs features compelling enough to justify the $99 per user per month E7 AI subscription tier. Copilot Cowork is currently available as an opt-in experimental feature through the Frontier program before broader rollout.
Previously limited to a small group of users in Research Preview, Cowork has now expanded to the wider Frontier audience. Microsoft 365 customers with eligible licenses can opt in through their admin portal to test Cowork and the new Critique capabilities ahead of general availability.
Capital Group, one of the world's largest investment management firms, is among the early enterprise adopters currently testing Cowork in a regulated financial services environment.
"This isn't about generating content or answers. It's about taking real action, connecting steps, coordinating tasks, and following through across everyday workflows. Because Cowork operates on our enterprise data and within our security and risk boundaries, we can experiment, learn, and scale with confidence. That allows us to move faster and focus AI in places where it actually delivers value."
Warner's emphasis on security boundaries reflects a broader enterprise concern: AI tools that operate outside an organization's data governance create compliance risks that outweigh productivity gains. By running Cowork within Microsoft's tenant architecture, the company positions it as a safer alternative to standalone AI agents that require local system access.
For regulated industries like financial services, this containment model could prove more persuasive than raw capability benchmarks. Jared Spataro, Chief Marketing Officer for Microsoft AI at Work, characterized the launch as a shift from Copilot as an assistant to Copilot as an autonomous agent capable of executing multi-step workflows independently.
As a result, for enterprise buyers weighing the E7 tier, the central question is whether agentic task delegation and multi-model research verification deliver enough measurable productivity gains to justify the premium subscription cost at scale across large global organizations.
Copilot Cowork is built on technology from Anthropic's Claude Cowork, which Anthropic launched for mainstream users in January 2026. Microsoft first added Claude as an OpenAI alternative in Microsoft 365 Copilot in September 2025, then deepened the partnership through a multi-billion-dollar alliance with Nvidia and Anthropic to scale Claude on Azure two months later.
Copilot Cowork represents the deepest integration yet between the two companies, embedding Claude's reasoning engine directly into core productivity workflows rather than offering it as merely an alternative model choice within the existing interface.
However, Copilot Cowork does not yet match the standalone Claude Cowork's full capabilities. It lacks local computer use, cannot interact directly with local files or applications, and has no native integrations with third-party tools outside Microsoft 365. For organizations that rely on non-Microsoft productivity tools, those gaps limit Cowork's ability to serve as a true end-to-end workflow agent.
Meanwhile, Microsoft has acknowledged the constraints and says it expects the Critique process to eventually run in both directions, with Claude drafting and GPT critiquing, giving users the option to run the process in both directions.
Broader availability beyond the invite-only Frontier program has not been given a firm public timeline. Whether expanded access arrives fast enough to move the adoption needle beyond 3.3% will ultimately determine whether the multi-model strategy becomes a lasting competitive advantage or remains an expensive experiment.