Top 12 text-extraction Open-Source Projects

sumy

1 3,428 6.3 Python

Module for automatic summarization of text documents and HTML pages.
trafilatura

13 2,898 8.7 Python

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
unipdf

24 2,375 5.6 Go

Golang PDF library for creating and processing PDF files (pure go)

Project mention: PDF Annotations and Collaboration with Golang PDF Library | dev.to | 2023-09-01

But how do we elevate our interaction with these files? How do we make annotations, edits, and collaboration more seamless? Here's where the Golang PDF Library swoops in like a superhero for your PDF woes.

tika-python

4 1,420 2.2 Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
datashare

4 547 9.8 Java

A self-hosted search engine for documents.
pdftools

1 499 4.6 C++

Text Extraction, Rendering and Converting of PDF Documents
srt

1 429 3.5 Python

A simple library and set of tools for parsing, modifying, and composing SRT files. (by cdown)
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
hotpdf

1 167 9.7 Python

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Project mention: Show HN: Hotpdf – Search and Extract text within PDFs | news.ycombinator.com | 2024-02-27

PDFIO.jl

1 122 3.5 Julia

PDF Reader Library for Native Julia.
cat

0 88 3.8 Go

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
docwire

1 51 9.8 C++

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

Project mention: DocWire SDK: Award-winning modern data processing in C++17/20 | news.ycombinator.com | 2024-01-04

ApyHub

13 5 1.5 TypeScript

ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.

Project mention: Exploring the Top SERP (Search Engine Result Pages) APIs | dev.to | 2024-04-03

Most of the providers let you test the APIs right from a terminal in the platform. Brightdata also provides an interactive API Playground for developers to test and check SERP Ranking for multiple search engines. ApyHub also has an API Playground where you can test the APIs in the UI plus you can see the snippet of the provided input in cURL. DataForSEO and Oxylabs don’t provide any internal tool for API tests but provide cURL request code to test on API Clients

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

text-extraction related posts

Trafilatura: Python tool to gather text on the Web

3 projects | news.ycombinator.com | 14 Aug 2023
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)

2 projects | /r/SideProject | 6 Mar 2023
Testing fast installation in tear-down environment

1 project | /r/learnpython | 6 Jul 2022
Advice on standard design pattern for comparison test script

1 project | /r/learnpython | 24 May 2022
Leaks! how to organize them?

4 projects | /r/OSINT | 2 May 2022
Journalism/research/information linking/graph db

4 projects | /r/selfhosted | 21 Apr 2022
Automate dependency installation

1 project | /r/learnpython | 9 Apr 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source text-extraction projects? This list will help you:

	Project	Stars
1	sumy	3,428
2	trafilatura	2,898
3	unipdf	2,375
4	tika-python	1,420
5	datashare	547
6	pdftools	499
7	srt	429
8	hotpdf	167
9	PDFIO.jl	122
10	cat	88
11	docwire	51
12	ApyHub	5

text-extraction

Top 12 text-extraction Open-Source Projects

text-extraction related posts

Trafilatura: Python tool to gather text on the Web

I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)

Testing fast installation in tear-down environment

Advice on standard design pattern for comparison test script

Leaks! how to organize them?

Journalism/research/information linking/graph db

Automate dependency installation

Index