SaaSHub helps you find the best software and product alternatives Learn more →
Top 14 Python Web Content Extracting Projects
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Project mention: Show HN: I made a tool to clean and convert any webpage to Markdown | news.ycombinator.com | 2024-04-14One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
Project mention: Como customizar o preview de links em redes sociais no Next.js | dev.to | 2024-03-20
Python Web Content Extracting related posts
- Como customizar o preview de links em redes sociais no Next.js
- Building an SEO-friendly responsive i18n website using Vite-SSG + Vuetify3
- Is there a reason why cover art is not showing up?
- What is an open graph? You must know this feature in web development.
- Trafilatura: Python tool to gather text on the Web
- python-readability – extract and clean up HTML main body text and title
- Displaying your full-sized YouTube thumbnail or a custom OG image in a Twitter card
-
A note from our sponsor - SaaSHub
www.saashub.com | 26 Apr 2024
Index
What are some of the best open-source Web Content Extracting projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | newspaper | 13,720 |
2 | toapi | 3,462 |
3 | sumy | 3,417 |
4 | trafilatura | 2,778 |
5 | python-readability | 2,563 |
6 | html2text | 1,655 |
7 | micawber | 622 |
8 | inscriptis | 233 |
9 | opengraph | 224 |
10 | Haul | 157 |
11 | htmldate | 106 |
12 | sanitize | 64 |
13 | JSONPATH | 37 |
14 | Data Extractor | 27 |
Sponsored