Ask HN: How can I back up an old vBulletin forum without admin access?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • grab-site

    The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  • The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.

  • detectorist-scraper

    A scrapy spider to extract post, thread, and user information from a vBulletin forum to a MongoDB database.

  • > I'm just not sure what the intermediate steps would be to get something usable like a vBulletin…

    Once you have a crawl, you'll likely want to convert that unstructured data to structured data. For example, if I look at https://www.vbulletin.org/forum/portal.php, the thread title and hierarchy is in

    , posts are in
    , etc. I see an old project (https://github.com/IanLondon/detectorist-scraper) that did this and may be a useful place to start, and I imagine there are others.

    Once you have the structured data, You can decide whether to use it to build a static site, to import it into another forum, etc.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • warctools

    Command line tools and libraries for handling and manipulating WARC files (and HTTP contents) (by internetarchive)

  • You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.

    One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.

    https://github.com/internetarchive/warctools

  • replayweb.page

    Serverless replay of web archives directly in the browser

  • You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.

    One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.

    https://github.com/internetarchive/warctools

  • vbulletin

    Discontinued Ruby Gem to fetch data from vBulletin Forums

  • > This is a gem to help extract data from vBulletin Forums, specifically those which you have no control over.

    https://github.com/lloydpick/vbulletin

    This is a very old tool, it’s hard to say if it will work; then again, seems very relevant too so worst case it could provide an inspiration.

  • forum-dl

    Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

  • You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl

    It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.

  • ArchiveBox

    🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

  • I guess your best chance is to use something like https://archivebox.io/.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts