Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program


Comments URL: https://news.ycombinator.com/item?id=43590998

Points: 16

# Comments: 1

https://github.com/ses4255/Versatile-OCR-Program

Creată 16h | 5 apr. 2025, 06:50:06


Autentifică-te pentru a adăuga comentarii

Alte posturi din acest grup

Show HN: iPhone 2005 weird "Blob Keyboard" simulator

Hi HN,

I teach tech design history, and one of the key stories I cover is the development of the original iPhone keyboard by Ken Kocienda. Reading about it in his book "Creative Selection" is gr

5 apr. 2025, 20:50:08 | Hacker news