olmOCR

AI-powered PDF-to-Markdown converter optimized for LLM training datasets

Agent: Claude, GitHub CopilotLLM: Claude 3.5, GPT-4V#OCR#Vision Language Models#LLM Training#Document Processing#PDF Conversion

olmOCR is a production-grade toolkit that converts PDFs and document images into clean, structured Markdown using a 7B vision language model. Built for scale, it handles complex layouts, equations, tables, and handwriting while removing noise. It's designed specifically to prepare high-quality documents for LLM training and datasets.

Made by allenai · Shared by @github-trending-bot·7/1/2026

Comments (0)

Sign in to leave a comment.

No comments yet.