/r/netsec - Information Security News & Discussion
The paper analyzes trust between stages in LLM and agent toolchains. If intermediate representations are accepted without verification, models may treat structure and format as implicit instructions, even when no explicit imperative appears. I document 41 mechanism level failure modes.
Scope
- Text-only prompts, provider-default settings, fresh sessions.
- No tools, code execution, or external actions.
- Focus is architectural risk, not operational attack recipes.
Selected findings
- Β§8.4 Form-Induced Safety Deviation: Aesthetics/format (e.g., poetic layout) can dominate semantics -> the model emits code with harmful side-effects despite safety filters, because form is misinterpreted as intent.
- Β§8.21 Implicit Command via Structural Affordance: Structured input (tables/DSL-like blocks) can be interpreted as a command without explicit verbs (βrun/executeβ), leading to code generation consistent with the structure.
- Β§8.27 Session-Scoped Rule Persistence: Benign-looking phrasing can seed a latent session rule that re-activates several turns later via a harmless trigger, altering later decisions.
- Β§8.18 Data-as-Command: Fields in data blobs (e.g., config-style keys) are sometimes treated as actionable directives -> the model synthesizes code that implements them.
Mitigations (paper Β§10)
- Stage-wise validation of model outputs (semantic + policy checks) before hand-off.
- Representation hygiene: normalize/label formats to avoid βformat -> intentβ leakage.
- Session scoping: explicit lifetimes for rules and for the memory
- Data/command separation: schema aware guards
Limitations
- Text-only setup; no tools or code execution.
- Model behavior is time dependent. Results generalize by mechanism, not by vendor.
submitted by
/u/Solid-Tomorrow6548 [link] [comments]