Self-hosted web scraper?

This page summarizes the projects mentioned and recommended in the original post on /r/selfhosted

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Huginn

    Create agents that monitor and act on your behalf. Your agents are standing by!

  • You didn't say what features are important or what about changedetection.io didn't work for you, but maybe ArchiveBox or Huginn

  • ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

  • You didn't say what features are important or what about changedetection.io didn't work for you, but maybe ArchiveBox or Huginn

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Trilium Notes

    Build your personal knowledge base with Trilium Notes

  • If you want to just scrape words, images and the formatting on a web page, you can use trilium notes along with their web clipper browser plugin. With the web clipper plugin you can copy the whole page as it is, images an all to your local trilium instance.

  • crawlab

    Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

  • Haven't tried but this project https://github.com/crawlab-team/crawlab looks promising.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts