Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Comments URL: https://news.ycombinator.com/item?id=43590998

Points: 16

# Comments: 1

https://github.com/ses4255/Versatile-OCR-Program

Created 21d | Apr 5, 2025, 6:50:06 AM

Other posts in this group

Colossal Cave Adventure (1976)

Article URL: https://github.com/wh0am1-dev/adventure

Comments URL: https://news

Apr 26, 2025, 5:50:04 AM | Hacker news

Show HN: Empty Enter Expander – Type less in the terminal with this tool

When you have a lot of aliases it can be difficult to remember how was the one you need named especially if you do not use it very often. You can also have files stored in a bin folder and look th

Apr 26, 2025, 5:50:03 AM | Hacker news

A tuition-free school created by Zuckerberg and Chan will shutter next year

Article URL: https://www.cnn.com/2025/04/25/tech/chan-zuckerberg-primary-school-closing/index.html

Apr 26, 2025, 5:50:02 AM | Hacker news