An attack class that passes every current LLM filter

/r/netsec - Information Security News & Discussion

30 March 2026 at 14:42

I opened OWASP issue #807 a few weeks ago proposing a new attack class. The paper is published today following coordinated disclosure to Anthropic, OpenAI, Google, xAI, CERT/CC, OWASP, and agentic framework maintainers.

Here is what I found.

Ordinary language buried in prior context shifts how a model reasons about a consequential decision before any instruction arrives. No adversarial signature. No override command. The model executes its instructions faithfully, just from a different starting angle than the operator intended.

I know that sounds like normal context sensitivity. It isn't, or at least the effect size is much larger than I expected. Matched control text of identical length and semantic similarity produced significantly smaller directional shifts. This specific class of language appears to be modeled differently. I documented binary decision reversals with paired controls across four frontier models.

The distinction from prompt injection: there is no payload. Current defenses scan for facts disguised as commands. This is frames disguised as facts. Nothing for current filters to catch.

In agentic pipelines it gets worse. Posture installs in Agent A, survives summarization, and by Agent C reads as independent expert judgment. No phrase to point to in the logs. The decision was shaped before it was made.

If you have seen unexplained directional drift in a pipeline and couldn't find the source, this may be what you were looking at. The lens might give you something to work with.

I don't have all the answers. The methodology is black-box observational, no model internals access, small N on the propagation findings. Limitations are stated plainly in the paper. This needs more investigation, larger N, and ideally labs with internals access stress-testing it properly.

If you want to verify it yourself, demos are at https://shapingrooms.com/demos - run them against any frontier model. If you have a production pipeline that processes retrieved documents or passes summaries between agents, it may be worth applying this lens to your own context flow.

Happy to discuss methodology, findings, or pushback on the framing. The OWASP thread already has some useful discussion from independent researchers who have documented related patterns in production.

GitHub issue: https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/issues/807

submitted by /u/lurkyloon
[link] [comments]

Normal view