Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • awesome-datahoarding

    List of data-hoarding related tools

  • You might also be interested in this list, those alternatives listed are really great and better, some support the WARC format (that my program doesn't).

  • Collect

    A server to collect & archive websites that also supports video downloads (by xarantolus)

  • I created Collect a few years ago and still use it today.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  • I use grab-site to crawl website and pack it into warc archive and then feed this archive into pywb

  • ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

  • I landed on an opensource project called Archivebox. Its pretty amazing (basically like a locally hosted wayback machine and crawler. https://github.com/ArchiveBox/ArchiveBox It also captures different ways to ensure data integrity and can schedule! Thanks everyone for your input and apps for me to research!

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ask HN: How can I back up an old vBulletin forum without admin access?

    9 projects | news.ycombinator.com | 29 Jan 2024
  • An Introduction to the WARC File

    7 projects | news.ycombinator.com | 29 Jan 2024
  • Best practices for archiving websites

    2 projects | /r/datacurator | 6 Dec 2023
  • How to mirror multiple websites correctly?

    2 projects | /r/DataHoarder | 12 May 2022
  • Stack Overflow Developer Story Data Dump (10 whole MB !)

    2 projects | /r/DataHoarder | 1 Apr 2022