Show HN: Kreuzberg – Modern async Python library for document text extraction

I'm excited to showcase Kreuzberg!

Kreuzberg is a modern Python library built from the ground up with async/await, type hints, and optimized I/O handling.

It provides a unified interface for extracting text from documents (PDFs, images, office files) without external API dependencies.

Key technical features: - Built with modern Python best practices (async/await, type hints, functional-first) - Optimized async I/O with anyio for multi-loop compatibility - Smart worker process pool for CPU-bound tasks (OCR, doc conversion) - Efficient batch processing with concurrent extractions - Clean error handling with context-rich exceptions

I built this after struggling with existing solutions that were either synchronous-only, required complex deployments, or had poor async support. The goal was to create something that works well in modern async Python applications, can be easily dockerized or used in serverless contexts, and relies only on permissive OSS.

Key advantages over alternatives: - True async support with optimized I/O - Minimal dependencies (much smaller than alternatives) - Perfect for serverless and async web apps - Local processing without API calls - Built for modern Python codebases with rigorous typing and testing

I Would love feedback!

The library is MIT licensed and open to contributions.

Here is the repo: https://github.com/Goldziher/kreuzberg

Staring is caring

Comments URL: https://news.ycombinator.com/item?id=43057375

Points: 10

# Comments: 5

https://github.com/Goldziher/kreuzberg

Vytvořeno 5mo | 15. 2. 2025 11:50:08

Chcete-li přidat komentář, přihlaste se

Ostatní příspěvky v této skupině

The Year of Peak Might and Magic

Article URL: https://www.filfre.net/2025/07/the-year-of-peak-might-and-magic/

Comments URL:

18. 7. 2025 21:30:21 | Hacker news

Third patient dies from acute liver failure caused by a Sarepta gene therapy

Article URL: https://www.biocentury.com/article/656520/third-death-from-a-sarepta-gene-therapy

18. 7. 2025 21:30:20 | Hacker news

How I keep up with AI progress

Article URL: https://blog.nilenso.com/blog/2025/06/23/how-i-keep-up-with-ai-progress/

Comments URL:

18. 7. 2025 21:30:19 | Hacker news

Cancer DNA is detectable in blood years before diagnosis

Article URL: https://www.sciencenews.org/article/cancer-tumor-dna-blood-test-screening

Comments URL:

18. 7. 2025 21:30:17 | Hacker news

Show HN: Molab, a cloud-hosted Marimo notebook workspace

We launched marimo [1], an open-source reactive Python notebook, last year on HackerNews. Today, the most popular recent feature request in Google Colab’s issue tracker asks for marimo support in

18. 7. 2025 21:30:16 | Hacker news

Replication of Quantum Factorisation Records with a VIC-20, an Abacus, and a Dog

Article URL: https://eprint.iacr.org/2025/1237

Comments URL: https://news.ycombinator

18. 7. 2025 21:30:15 | Hacker news

Asynchrony Is Not Concurrency

Article URL: https://kristoff.it/blog/asynchrony-is-not-concurrency/

Comments URL:

18. 7. 2025 21:30:14 | Hacker news

Techie