❌

Reading view

r/netsec monthly discussion & tool thread

Questions regarding netsec and discussion related directly to netsec are welcome here, as is sharing tool links.

Rules & Guidelines

  • Always maintain civil discourse. Be awesome to one another - moderator intervention will occur if necessary.
  • Avoid NSFW content unless absolutely necessary. If used, mark it as being NSFW. If left unmarked, the comment will be removed entirely.
  • If linking to classified content, mark it as such. If left unmarked, the comment will be removed entirely.
  • Avoid use of memes. If you have something to say, say it with real words.
  • All discussions and questions should directly relate to netsec.
  • No tech support is to be requested or provided on r/netsec.

As always, the content & discussion guidelines should also be observed on r/netsec.

Feedback

Feedback and suggestions are welcome, but don't post it here. Please send it to the moderator inbox.

submitted by /u/albinowax
[link] [comments]
  •  

I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

I built an independent benchmark with 20 real CVEs across 15 CWE categories, 5 models (3 OpenAI, 2 Poolside Laguna), three prompt conditions: full advisory, behavioral description only, and location only (file and function, no description of the flaw).

I have three findings worth sharing:

  • No model reliably fixes real vulnerabilities. The best solve rate (gpt-5.5) is 50% overall and 60% under the most favorable condition. The failure modes (e.g, wrong-search drift, budget exhaustion mid-implementation, plausible-but-incomplete patches that pass every visible test) are structured and repeatable across models and tasks.
  • Token cost varies 4x for equivalent outcomes. The Laguna models consume 3–4x more tokens than OpenAI models of the same capability tier, with no improvement in solve rate.
  • The locate condition is the benchmark's sharpest instrument. Give a model only a file and function (no description of the flaw). Every model drops. The differences between models are within noise at this scale, but it's the condition that most closely resembles what a security researcher actually does: reading code cold and recognizing independently that something is wrong.

Benchmark code and evaluation traces are open sourced.

submitted by /u/Fickle-Box1433
[link] [comments]
  •  
❌