Top 3 Python article-extractor Projects
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Google-Docs-To-Clean-HTML
A Google Docs HTML Cleaner: This program transforms messy HTML from Google Docs into clean code primarily using LXML with a modular mixin design pattern.
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Python article-extractor related posts
Index
What are some of the best open-source article-extractor projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | trafilatura | 2,778 |
2 | sneakpeek | 101 |
3 | Google-Docs-To-Clean-HTML | 5 |
Sponsored