We’ve just open-sourced SemHash, a lightweight package for semantic text deduplication. It lets you effortlessly clean up your datasets and avoid pitfalls caused by duplicate samples in semantic search, RAG, and machine learning.
Main Features:
- Fast and hardware friendly: Deduplicate datasets with millions of records in minutes, on a CPU.
- Flexible: Works on single or multiple datasets (e.g., train/test deduplication), and multi-column data (e.g., Question-Answering datasets).
- Lightweight: Minimal dependencies (largest is NumPy).
- Explainable: Easily inspect duplicates and what caused them, and view the lowest similarity duplicates to adjust the threshold based on your dataset.
We found that text deduplication is more complex than it appears, so we built SemHash to simplify the process. Duplicate samples can skew model training, reduce generalization, and cause train-test leakage—leading to unreliable results. Techniques like minhash handle exact or near-exact duplicates, but semantic deduplication also catches semantically redundant samples, which we believe is an important aspect of deduplication. Furthermore, it’s not trivial to see why something was removed with minhash, which we also believe is important. We already found some interesting results on some well known datasets in our benchmarks which are included in the repo.
We are curious to hear your feedback! Do you currently deduplicate your datasets before training, and what techniques do you use?
Comments URL: https://news.ycombinator.com/item?id=42674627
Points: 12
# Comments: 2
Accedi per aggiungere un commento