Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/DataHoarder

Our great sponsors
  • InfluxDB - Build time-series-based applications quickly and at scale.
  • Sonar - Write Clean Python Code. Always.
  • SaaSHub - Software Alternatives and Reviews
  • awesome-datahoarding

    List of data-hoarding related tools

    You might also be interested in this list, those alternatives listed are really great and better, some support the WARC format (that my program doesn't).

  • Collect

    A server to collect & archive websites that also supports video downloads (by xarantolus)

    I created Collect a few years ago and still use it today.

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.

  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

    I use grab-site to crawl website and pack it into warc archive and then feed this archive into pywb

  • ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

    I landed on an opensource project called Archivebox. Its pretty amazing (basically like a locally hosted wayback machine and crawler. https://github.com/ArchiveBox/ArchiveBox It also captures different ways to ensure data integrity and can schedule! Thanks everyone for your input and apps for me to research!

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts