Just finished reading ActiveFenceβs emerging threats assessment on 7 major models across hate speech, disinfo, fraud, and CSAM-adjacent prompts.
Key findings are: 44% of outputs were rated risky, 68% of unsafe ones were hate-speech-related, and only a single model landed in the safe range.
What really jumps out is how different vendors behave per abuse area (fraud looks relatively well-covered, hate and child safety really donβt).
For those doing your own evals/red teaming: are you seeing similar per-category gaps? Has anyone brought in an external research partner like ActiveFence to track emerging threats over time?