Web Content Extracting

Open-source projects categorized as Web Content Extracting
Language: + Python + HTML

Top 18 Web Content Extracting Open-Source Projects

  • newspaper

    newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

  • python-goose

    Html Content / Article Extractor, web scrapping lib in Python

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • textract

    extract text from any document. no muss. no fuss.

  • toapi

    Every web site provides APIs.

  • sumy

    Module for automatic summarization of text documents and HTML pages.

  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

  • Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

    The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

    Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

  • python-readability

    fast python port of arc90's readability tool, updated to match latest readability.js!

  • Project mention: Show HN: I made a tool to clean and convert any webpage to Markdown | news.ycombinator.com | 2024-04-14

    One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • html2text

    Convert HTML to Markdown-formatted text. (by Alir3z4)

  • Goose3

    A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html

  • micawber

    a small library for extracting rich content from urls

  • lassie

    Web Content Retrieval for Humans™

  • inscriptis

    A python based HTML to text conversion library, command line client and Web service.

  • opengraph

    A python module to parse the Open Graph Protocol

  • Project mention: Add Thumbnails to your project links for better SEO | dev.to | 2024-05-01

    OpenGraph docs

  • Haul

    An Extensible Image Crawler (by vinta)

  • htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

  • sanitize

    Bringing sanity to world of messed-up data

  • JSONPATH

    A query expression for extracting data from JSON. (by linw1995)

  • Data Extractor

    Combine XPath, CSS Selectors and JSONPath for Web data extracting.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Web Content Extracting related posts

  • Add Thumbnails to your project links for better SEO

    2 projects | dev.to | 1 May 2024
  • Como customizar o preview de links em redes sociais no Next.js

    1 project | dev.to | 20 Mar 2024
  • Building an SEO-friendly responsive i18n website using Vite-SSG + Vuetify3

    1 project | dev.to | 17 Mar 2024
  • Java virtual threads caused a deadlock in TPC-C for PostgreSQL

    4 projects | news.ycombinator.com | 15 Jan 2024
  • Is there a reason why cover art is not showing up?

    1 project | /r/AO3 | 8 Dec 2023
  • What is an open graph? You must know this feature in web development.

    1 project | dev.to | 23 Oct 2023
  • Making Dynamic Website Thumbnail

    4 projects | dev.to | 21 Sep 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 4 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Web Content Extracting projects? This list will help you:

Project Stars
1 newspaper 13,737
2 python-goose 3,942
3 textract 3,784
4 toapi 3,462
5 sumy 3,419
6 trafilatura 2,853
7 python-readability 2,568
8 html2text 1,664
9 Goose3 765
10 micawber 622
11 lassie 600
12 inscriptis 233
13 opengraph 224
14 Haul 157
15 htmldate 107
16 sanitize 64
17 JSONPATH 37
18 Data Extractor 27

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com