Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG

I built Chonkie because I was tired of rewriting chunking code for RAG applications. Existing libraries were either too bloated (80MB+) or too basic, with no middle ground.

Core features:

- 21MB default install vs 80-171MB alternatives

- 33x faster token chunking than popular alternatives

- Supports multiple chunking strategies: token, word, sentence, and semantic

- Works with all major tokenizers (transformers, tokenizers, tiktoken)

- Zero external dependencies for basic functionality

Technical optimizations:

- Uses tiktoken with multi-threading for faster tokenization

- Implements aggressive caching and precomputation

- Running mean pooling for efficient semantic chunking

- Modular dependency system (install only what you need)

Benchmarks and code: https://github.com/bhavnicksm/chonkie

Looking for feedback on the architecture and performance optimizations. What other chunking strategies would be useful for RAG applications?


Comments URL: https://news.ycombinator.com/item?id=42100819

Points: 51

# Comments: 18

https://github.com/bhavnicksm/chonkie

Établi 2d | 10 nov. 2024 à 19:20:11


Connectez-vous pour ajouter un commentaire