The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 5 Python text-extraction Projects
-
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
-
-
hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Project mention: Show HN: Hotpdf – Search and Extract text within PDFs | news.ycombinator.com | 2024-02-27 -
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Python text-extraction related posts
- Trafilatura: Python tool to gather text on the Web
- I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
- Testing fast installation in tear-down environment
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
-
A note from our sponsor - WorkOS
workos.com | 29 Mar 2024
Index
What are some of the best open-source text-extraction projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | sumy | 3,401 |
2 | trafilatura | 2,656 |
3 | tika-python | 1,395 |
4 | srt | 421 |
5 | hotpdf | 158 |