Hi HN,
I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.
Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks
Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.
GitHub: https://github.com/ses4255/Versatile-OCR-Program
Comments URL: https://news.ycombinator.com/item?id=43590998
Points: 16
# Comments: 1
Login to add comment
Other posts in this group

Article URL: https://github.com/wh0am1-dev/adventure
Comments URL: https://news

When you have a lot of aliases it can be difficult to remember how was the one you need named especially if you do not use it very often. You can also have files stored in a bin folder and look th
Article URL: https://www.doliver.org/articles/rss-as-a-skill
