Our great sponsors
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
detectorist-scraper
A scrapy spider to extract post, thread, and user information from a vBulletin forum to a MongoDB database.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
warctools
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents) (by internetarchive)
-
forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
> I'm just not sure what the intermediate steps would be to get something usable like a vBulletin…
Once you have a crawl, you'll likely want to convert that unstructured data to structured data. For example, if I look at https://www.vbulletin.org/forum/portal.php, the thread title and hierarchy is in
, posts are in, etc. I see an old project (https://github.com/IanLondon/detectorist-scraper) that did this and may be a useful place to start, and I imagine there are others.Once you have the structured data, You can decide whether to use it to build a static site, to import it into another forum, etc.
You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.
One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.
https://github.com/internetarchive/warctools
You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.
One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.
https://github.com/internetarchive/warctools
> This is a gem to help extract data from vBulletin Forums, specifically those which you have no control over.
https://github.com/lloydpick/vbulletin
This is a very old tool, it’s hard to say if it will work; then again, seems very relevant too so worst case it could provide an inspiration.
You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl
It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.
I guess your best chance is to use something like https://archivebox.io/.