InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Docext Alternatives
Similar projects and alternatives to docext
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
-
table-transformer
🔍 Table Extraction Tool: A powerful open-source solution combining OCR and computer vision for extracting structured tabular data from images. Ideal for LLM preprocessing, data analysis, and automation. 🚀 (by Sudhanshu1304)
-
-
-
extractous
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
pdf-craft
PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.
-
MindsDB
AI's query engine - Platform for building AI that can answer questions over large scale federated data. - The only MCP Server you'll ever need
-
pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
docext discussion
docext reviews and mentions
-
Open-source 3B param model better than Mistral OCR
Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured
tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in blocks.
Watermark Extraction Extracts watermark text and stores it within tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
- Show HN: Open-source 3B param model better than Mistral OCR
- Gemini-2.5-pro-preview-06-05 performance on IDP Leaderboard
- Show HN: Open-source, cross platform document data extraction with no OCR
- Show HN: On-premises unstructured data extraction with 4 lines of code
-
A note from our sponsor - InfluxDB
www.influxdata.com | 14 Jul 2025
Stats
NanoNets/docext is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of docext is Python.