Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 12 text-extraction Open-Source Projects
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
-
docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
But how do we elevate our interaction with these files? How do we make annotations, edits, and collaboration more seamless? Here's where the Golang PDF Library swoops in like a superhero for your PDF woes.
Project mention: Show HN: Hotpdf – Search and Extract text within PDFs | news.ycombinator.com | 2024-02-27
Project mention: DocWire SDK: Award-winning modern data processing in C++17/20 | news.ycombinator.com | 2024-01-04
Most of the providers let you test the APIs right from a terminal in the platform. Brightdata also provides an interactive API Playground for developers to test and check SERP Ranking for multiple search engines. ApyHub also has an API Playground where you can test the APIs in the UI plus you can see the snippet of the provided input in cURL. DataForSEO and Oxylabs don’t provide any internal tool for API tests but provide cURL request code to test on API Clients
text-extraction related posts
-
Trafilatura: Python tool to gather text on the Web
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
-
Testing fast installation in tear-down environment
-
Advice on standard design pattern for comparison test script
-
Leaks! how to organize them?
-
Journalism/research/information linking/graph db
-
Automate dependency installation
-
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024
Index
What are some of the best open-source text-extraction projects? This list will help you:
Project | Stars | |
---|---|---|
1 | sumy | 3,428 |
2 | trafilatura | 2,898 |
3 | unipdf | 2,375 |
4 | tika-python | 1,420 |
5 | datashare | 547 |
6 | pdftools | 499 |
7 | srt | 429 |
8 | hotpdf | 167 |
9 | PDFIO.jl | 122 |
10 | cat | 88 |
11 | docwire | 51 |
12 | ApyHub | 5 |
Sponsored