Code Security

The Rise of Mythos: Exposing the Remediation Gap

I don’t think there is anyone left that hasn’t heard of Anthropic’s (now famous, but not yet released) frontier model, Mythos. While it has caused a tidal wave of hysteria inside the cybersecurity community, It is not a cybersecurity product. That distinction matters, and most of the coverage around it has missed this point entirely.

Anthropic built Mythos as an advanced frontier reasoning model. The cybersecurity implications are a consequence of what it can do, not the purpose it was designed for.

When a model reaches a certain threshold of reasoning capability, the ability to hold complex hypotheses, trace logic across large codebases, and iterate autonomously toward a confirmed result, it becomes extraordinarily useful for finding vulnerabilities. Not because it was trained to find them. Because finding them is a reasoning problem, and this class of model reasons better than anything that came before it.

Two capabilities define the step change. The first is exploit chain construction. A real attack rarely uses a single vulnerability. It chains primitives, a memory bug here, an input validation gap there, a logic flaw that connects them, into a working exploit path. Mythos can take lower-severity findings that would ordinarily sit invisible in a backlog and reasons about how to combine them into something critical. The second capability is proof generation. Finding a bug and proving it’s exploitable are two different things.

Given the security implications that a model this advanced holds, access has been heavily restricted. Instead, Anthropic started project Glasswing as a way to give key cybersecurity vendors and critical infrastructure companies early access for defensive testing. As results from several companies begin to surface in various early look reports, the results look very promising. A report from XBOWw found that in under three weeks, the model accomplished the equivalent of a full year of penetration testing effort. A year of work, compressed into three weeks, at scale, across an entire codebase. The implication isn’t that security teams can be smaller. The implication is that attackers who gain access to comparable capability can move through your attack surface at a speed your existing defenses were never designed to handle.

Mythos isn’t alone. OpenAI moved quickly with their own comparable model release and equivalent early access program called “Trusted Access for Cyber”. While it might take a few additional months, similar advances are expected from other major AI labs. The guardrails that exist today will not contain this capability indefinitely.

An Accelerated Wave of Attacks

Here’s what makes this moment different from every previous “AI changes security” announcement: the wave of attacks is already here. Mythos isn’t publicly available. Broad access to these capabilities hasn’t happened yet. And supply chain attacks are already accelerating.

The last sixty days have seen a meaningful uptick in compromises targeting open source libraries and developer AI supply chains. Attackers have found a playbook that seems to be working well. Compromise the tool a developer uses every day, and you don’t need to breach the perimeter. The developer brings you inside.

This is happening without Mythos. While other available models might be helpful in accelerating these attacks, the velocity at which these compromises are happening show no signs of slowing down. The organizations that haven’t adjusted their security programs for the rate at which this is all moving will face a threat they weren’t structurally designed to handle. The supply chain is an attack surface. The developer toolchains are an attack surface. And right now, most organizations have no systematic inventory of what’s running in either.

The uncomfortable truth is that the sophistication of the incoming threat doesn’t actually change what needs to be done first. It just dramatically reduces the time for organizations to close the gaps.

A Need for Change

For most of the history of application security, one of the hardest problems was detection. Finding vulnerabilities in production code, at scale, across large and complex systems; that was where the investment went, where the tooling matured, and where the industry built its expertise. Execution was always a known struggle. Prioritizing what to fix. Getting developers to act on findings at all. Scaling remediation beyond the handful of critical issues that could command enough organizational attention to move.

Mythos and GPT-5.4-Cyber break this framework completely, and they break it in both directions simultaneously.

Discovery becomes cheap. Effectively unlimited. The capabilities required to find code security issues or run a skilled penetration testing can be reduced to weeks or even days. That compression will continue. The bottleneck that defined application security for a generation, we can’t find enough risk fast enough, is gone. What replaces it is a bottleneck your program almost certainly wasn’t built for: remediation.

Context is now the prerequisite for fixing anything, not the finding itself. What’s hard is understanding enough about where that finding lives to actually close it. Consider these common elements:

  • Blast radius: if this component is compromised, what else does it touch?
  • Ownership: who is responsible for this code right now, not when it was written?
  • Architecture: does a fix here break something downstream?
  • Test coverage: is there a regression suite that will catch a bad patch before it ships?

These questions used to sit behind the harder problem of finding the issue in the first place. Now they are the problem.Remediation also requires an engineering foundation that most security programs never had to heavily invest in, because fixing things at the speed this moment demands requires more than a good model and a willing developer.

It requires automated tests that catch regressions. Documented architecture that a model can reason against. Clear ownership that doesn’t require a forensic exercise to establish. Cloudflare learned this the direct way: they let Mythos write its own patches during their Project Glasswing testing and watched several go out that fixed the original vulnerability while quietly breaking something the code depended on. A fast patch that introduces a new flaw isn’t remediation. It’s a different incident.

Here’s the severity threshold problem: most organizations have gates designed to keep critical and high-severity vulnerabilities out of production. Those gates were built for a world where findings are evaluated individually. Frontier model reasoning operates differently. It can chain together low-severity findings, the ones your gate lets through, the ones that have been sitting in your backlog for years, into critical exploit paths. Suddenly two medium issues and a low become a chained exploit path. Your governance program never saw it coming because it was never designed to look for this pattern.

Suddenly the entire backlog is in scope. Every finding you deferred, de-prioritized, or accepted as low risk because it sat below your severity threshold is now potentially part of a chain an attacker can follow. The question isn’t whether your critical vulnerabilities are patched. It’s whether your accumulated technical debt is exploitable in combination, and the honest answer for most organizations is: you don’t know.

A Model Without a Harness

XBOW distilled their Mythos evaluation into a single line that captures the problem better than most, “A model is a brain without a body”. Source code audits are mostly a brain activity. Everything that follows, live-site validation, orchestrated testing, safe proof-of-concept execution, governed remediation, compliance documentation, requires a body whose skill and control can match the brain’s power.

When Cloudflare received access to Mythos, their instinct was the obvious one. Point a coding agent at a repository, ask it to find vulnerabilities, see what it returns. It returned findings. What it didn’t return was meaningful coverage. A single-stream agent working against a hundred-thousand-line repository covers roughly a tenth of a percent of the attack surface before the context window fills up and compaction kicks in; potentially discarding earlier findings that would have mattered. You get output. You don’t get confidence.

The signal-to-noise problem compounds this. Ask a model to find bugs and it will find them, regardless of whether the code has any. Findings come back hedged with “possibly,” “potentially,” “could in theory”; qualifications that are reasonable for an exploratory tool and catastrophic for a triage queue, where every speculative finding consumes human attention and tokens to dismiss, and that cost compounds across thousands of results. Cloudflare showed how they built their harness to deliberately over-report at the discovery stage, accepting more noise in exchange for less missed coverage, which meant their triage workload was significant.

XBOW’s evaluation added a nuance that the raw benchmark numbers don’t fully surface: Mythos’s judgment is mixed. It can be too literal and too conservative, rejecting valid findings because the evidence didn’t formally satisfy its criteria, or because the intended rule was broader than the written one. It also tends to overstate the practical relevance of what it does surface. Strong reasoning doesn’t automatically produce reliable security outcomes. It produces reliable reasoning. To get from reasoning to outcomes, you need precise prompts, explicit threat models, and validation infrastructure that the model alone cannot provide.

There’s also the cost dimension, which has taken on the name “tokenomics”. Mythos is priced at roughly five times the cost of a previous-generation frontier model. XBOW’s analysis found that for many tasks, running a less expensive model longer produces better results at lower cost. That’s not a knock on Mythos, it’s a practical observation about the economics of running security operations at scale. The right model for discovery may not be the right model for triage. The right model for triage may not be the right model for fix generation. Model selection isn’t a one-time architectural decision; it’s a task-level one, and the harness has to be capable of making it.

And beyond discovery, a frontier model cannot, on its own, do any of the following:

  • Scan, fix, and rescan in a coordinated loop
  • Enforce policy or compliance gates
  • Govern its own behavior
  • Generate an audit trail that satisfies a regulator
  • Measure the impact of its work on your AppSec program over time

These aren’t limitations of Mythos, or any model, specifically; they’re structural properties of what a language model is. It reasons. It does not enforce. It generates. It does not govern. Pointing it at your codebase and expecting it to close your security gap is like buying the best engine in the world and expecting it to drive itself to work.

Unbounded discovery without a control layer doesn’t improve your security posture. It improves your inventory of exposure you can’t yet act on.

Building the Harness

Cloudflare, XBOW, and every team that ran Mythos at operational scale reached the same conclusion independently, through different paths. The model doing the wrong job isn’t a model problem. It’s an architecture problem. Once they stopped trying to make a reasoning model behave like a coordinated security pipeline, they started building the structure that made it useful.

Cloudflare’s production harness, built across their runtime, edge data path, protocol stack, and control plane, ran eight stages: Recon, Hunt, Validate, Gapfill, Dedupe, Trace, Feedback, and Report. Each stage is a distinct agent with a distinct scope. The Recon stage produces an architecture document, trust boundaries, entry points, likely attack surface, that gives every downstream agent shared context and eliminates the wandering problem. The Hunt stage runs roughly fifty agents concurrently, each working a single attack class paired with a scope hint. The Validate stage sends findings to an independent agent with a different prompt, a different model, and no ability to generate findings of its own, deliberately creating disagreement to catch what a self-reviewing agent would miss. Trace turns “there is a flaw” into “there is a reachable vulnerability” by following confirmed findings through every consumer repository that depends on the affected component. Feedback closes the loop by converting reachable traces into new hunt tasks, so the pipeline improves as it runs.

Four operational lessons emerged from running this at scale, and each one holds regardless of which model sits inside the harness.

  1. Narrow scope produces better findings. Telling the model to find vulnerabilities in a repository produces wandering. The prompt is a precision instrument, not an open invitation.
  2. Adversarial review reduces noise. A second agent whose only job is to disprove the first agent’s findings catches a meaningful fraction of the noise the first agent would miss reviewing its own work. Deliberate disagreement between agents is more effective than asking one agent to be careful.
  3. Splitting the chain across agents produces better reasoning. The model answers each better when they’re asked separately, because each question is narrower than the combined version.
  4. Parallel narrow tasks beat one exhaustive agent. Many agents working tightly scoped, parallel hypotheses with deduplication afterward produces better coverage than one agent trying to be comprehensive. Coverage of the right surface is the right goal.

XBOW’s operational conclusion aligns with Cloudflare’s from the live application direction: the ideal detection pattern combines source code analysis to find a lead, live application interaction to understand how the weakness is reflected in actual deployment, and then exploit construction from that confirmed understanding. Neither dimension alone is sufficient. Source code shows you what the code says. The live application shows you what the system does. Security lives in the gap between them.

What does a harness actually require? Context management, security policies, engineering documentation, codebase ownership, architecture design, and test coverage; this isn’t a simple “prompt and fix”.

Memory that tracks what has been found, attempted, rejected, and why, so the system doesn’t repeat itself across the same terrain. Internal nuance that reflects organizational conventions, wrappers, and patterns, because a fix that ignores how your team actually builds software isn’t a fix. Traditional scan engines as a foundation, because SAST, SCA, and secrets scanning provide the deterministic baseline that model-assisted remediation needs to operate against; frontier model findings extend this foundation, they don’t replace it. And model selection by task, because the economics and capability profiles of different models mean the right choice for discovery is rarely the right choice for every step that follows.

Building this is not a configuration exercise. It requires ongoing operational maintenance, policy definitions that evolve as the threat does, feedback loops from fixes that get rejected in review, and human judgment at the gates that actually matter. The human in the loop at PR review is the current floor. The harness is what makes that floor mean something, rather than a rubber stamp on output nobody fully reads.

How to Get Started

The Mythos announcements prompted a predictable response from most security teams: how do we get access to this capability? That’s the wrong first question.

The right question is whether you’re structured to absorb what it finds.

Access to frontier AI vulnerability discovery without the infrastructure to act on what it surfaces doesn’t improve your security posture. It improves your inventory of exposure you can’t close (at a very high price point I might add). Before you evaluate tooling, four things are worth understanding clearly about where you stand today.

Know your actual exposure across critical applications. Pick your five most high value applications, the ones where a breach has immediate business, regulatory, or reputational consequence, and pull every open vulnerability across them right now. Not a dashboard summary. The actual list, with severity, age, and named owner. Most teams that do this exercise find it uncomfortable. That discomfort is useful information. If a frontier model ran against those applications today, this is roughly what it would surface to an attacker; plus the chains connecting your lower-severity findings into paths you haven’t accounted for.

Audit your AppSec gating policy for the chaining problem. Your current severity thresholds were designed to evaluate findings individually. They were not designed for a world where three low-severity findings chain into a critical exploit path. Look at your policy and ask honestly: does it account for this? Does your gate consider combinations, or only individual severity scores? If the answer is no, that gap exists today regardless of what AI capability your adversaries have access to. Fix the policy before you add more discovery.

Measure your real MTTR and be honest about what it includes. Mean-time-to-Remediation is easy to report and easy to inflate. The number that matters is the time from confirmed vulnerability to patch in production; including review cycle, regression run, and deployment. One major security organization, after their Mythos testing, set a two-hour SLA for patch deployment and immediately discovered it was structurally incompatible with their existing regression testing requirements. That’s exactly the kind of honesty that organizations need to acknowledge right now.

Assess your remediation infrastructure, not just your detection capability. A frontier model surfacing vulnerabilities faster than your engineering team can process them isn’t a security program upgrade. It’s a backlog problem with better inputs. Before adding AI-assisted discovery to your pipeline, ask whether the foundation that makes remediation work actually exists:

  • Are your critical applications covered by automated tests that catch regressions?
  • Is there documented architecture that a model can reason against when generating a fix?
  • Is there clear, current ownership for every component that will need to be patched?

If the answer to any of these is no, that’s where the work starts. Discovery without remediation is just a more detailed inventory of your exposure.

The teams best positioned for what comes next aren’t the ones who got earliest access to the most capable model. They’re the ones who already know what they’re carrying, have a harness that can close issues, and have built the structure connecting discovery to remediation to governance. That work starts before the frontier model arrives. For most organizations, it starts with those four questions, and it starts now.