New research confirms what we suspected: every LLM tested can be exploited

By: ／u／CortexVortex1

Just finished reading ActiveFence’s emerging threats assessment on 7 major models across hate speech, disinfo, fraud, and CSAM-adjacent prompts.

Key findings are: 44% of outputs were rated risky, 68% of unsafe ones were hate-speech-related, and only a single model landed in the safe range.

What really jumps out is how different vendors behave per abuse area (fraud looks relatively well-covered, hate and child safety really don’t).

For those doing your own evals/red teaming: are you seeing similar per-category gaps? Has anyone brought in an external research partner like ActiveFence to track emerging threats over time?

submitted by /u/CortexVortex1
[link] [comments]

FreshRSS

New research confirms what we suspected: every LLM tested can be exploited