Lightweight Safety Classification Using Pruned Language Models

Layer Enhanced Classification (LEC) is a novel technique that outperforms current industry leaders like GPT-4o, LlamaGuards 1 and 8B, and deBERTa v3 Prompt Injection v2 for content safety and prompt injection tasks.

We prove that the intermediate hidden layers in transformers are robust feature extractors for text classification.

On content safety, LEC models achieved a 0.96 F1 score vs GPT-4o's 0.82 and Llama Guard 8B's 0.71.The LEC models were able to outperform the other models with only 15 training examples for binary classification and 50 examples for multi-class classification across 66 categories.

On prompt injection,LEC models achieved a 0.98 F1 score vs GPT-4o's 0.92 and deBERTa v3 Prompt Injection v2's 0.73. LEC models were able to outperform deBERTa with only 5 training examples and GPT-4o with only 55 training examples.

Read the full paper and our approach here: https://arxiv.org/abs/2412.13435


Comments URL: https://news.ycombinator.com/item?id=42463943

Points: 6

# Comments: 0

https://arxiv.org/abs/2412.13435

Creată 1mo | 19 dec. 2024, 19:40:17


Autentifică-te pentru a adăuga comentarii

Alte posturi din acest grup

Show HN: Voice Cloning and Multilingual TTS in One Click (Windows)

We've created an open-source alternative to Eleven Labs for voice cloning and multilingual TTS. Key features:

- Clone voices from 15-second samples - 50+ pre-trained celebrity voice models - Sup

27 ian. 2025, 04:20:10 | Hacker news
Making a live-mode test payment to yourself = a payment processor ToS violation?

To me it seemed like common sense that before you push the thing into the real world, no matter how much testing you do, you'd fire a couple test payments on the live version.

Apparently, after

27 ian. 2025, 04:20:08 | Hacker news