contextgem
docext
contextgem | docext | |
---|---|---|
3 | 5 | |
1,237 | 1,448 | |
17.1% | 66.9% | |
8.6 | 9.5 | |
4 days ago | 15 days ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
contextgem
-
Transform DOCX into LLM-ready data
As part of work on my open-source project ContextGem, I've built a native, zero-dependency DOCX converter that transforms Word documents into LLM-ready data.
This custom-built converter directly processes Word XML, provides comprehensive content extraction + covers what other open-source tools often miss or lack support for:
- Rich paragraph and sentence metadata for enhanced context
- Misaligned tables
- Comments, footnotes, and textboxes
- Embedded images
The converted document can then be easily used in ContextGem's LLM extraction workflows.
Perfect for developers building contract intelligence applications where precision matters. The converter preserves document structure and relationships, empowering LLMs to better understand and analyze document content.
Try it / share with your dev team today and see the difference in your document processing pipeline!
GitHub: https://github.com/shcherbak-ai/contextgem
All DocxConverter features: https://contextgem.dev/converters/docx.html
-
I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple
After getting tired of writing endless boilerplate to extract structured data from documents with LLMs, I built ContextGem - a free, open-source framework that makes this radically easier.
- ContextGem: Easier and faster way to build LLM extraction workflows
docext
-
Open-source 3B param model better than Mistral OCR
Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured
tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in blocks.
Watermark Extraction Extracts watermark text and stores it within tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
- Show HN: Open-source 3B param model better than Mistral OCR
- Gemini-2.5-pro-preview-06-05 performance on IDP Leaderboard
- Show HN: Open-source, cross platform document data extraction with no OCR
- Show HN: On-premises unstructured data extraction with 4 lines of code
What are some alternatives?
sdk - Lightfeed SDK to search and filter web data
doctor - A microservice for document conversion at scale
validex - Simplifies the retrieval, extraction, and training of structured data from various unstructured sources.
table-transformer - 🔍 Table Extraction Tool: A powerful open-source solution combining OCR and computer vision for extracting structured tabular data from images. Ideal for LLM preprocessing, data analysis, and automation. 🚀
dn-institute - Distributed Networks Institute
awesome-document-understanding - A curated list of resources for Document Understanding (DU) topic