Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 18 Web Content Extracting Open-Source Projects
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Project mention: Show HN: I made a tool to clean and convert any webpage to Markdown | news.ycombinator.com | 2024-04-14One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
OpenGraph docs
Web Content Extracting related posts
-
Add Thumbnails to your project links for better SEO
-
Como customizar o preview de links em redes sociais no Next.js
-
Building an SEO-friendly responsive i18n website using Vite-SSG + Vuetify3
-
Java virtual threads caused a deadlock in TPC-C for PostgreSQL
-
Is there a reason why cover art is not showing up?
-
What is an open graph? You must know this feature in web development.
-
Making Dynamic Website Thumbnail
-
A note from our sponsor - InfluxDB
www.influxdata.com | 4 May 2024
Index
What are some of the best open-source Web Content Extracting projects? This list will help you:
Project | Stars | |
---|---|---|
1 | newspaper | 13,737 |
2 | python-goose | 3,942 |
3 | textract | 3,784 |
4 | toapi | 3,462 |
5 | sumy | 3,419 |
6 | trafilatura | 2,853 |
7 | python-readability | 2,568 |
8 | html2text | 1,664 |
9 | Goose3 | 765 |
10 | micawber | 622 |
11 | lassie | 600 |
12 | inscriptis | 233 |
13 | opengraph | 224 |
14 | Haul | 157 |
15 | htmldate | 107 |
16 | sanitize | 64 |
17 | JSONPATH | 37 |
18 | Data Extractor | 27 |
Sponsored