Git scraping: track changes over time by scraping to a Git repository

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • queensland-traffic-conditions

    A scraper that tracks changes to the published queensland traffic incidents data

  • I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

    A fun way to track how people are using this is with the git-scraping topic on GitHub:

    https://github.com/topics/git-scraping?o=desc&s=updated

    That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

    As I write this, just in the last minute repos that updated include:

    https://github.com/drzax/queensland-traffic-conditions

    https://github.com/jasoncartwright/bbcrss

    https://github.com/jackharrhy/metrobus-timetrack-history

    https://github.com/outages/bchydro-outages

  • bbcrss

    Discontinued Scrapes the headlines from BBC News indexes every five minutes

  • I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

    A fun way to track how people are using this is with the git-scraping topic on GitHub:

    https://github.com/topics/git-scraping?o=desc&s=updated

    That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

    As I write this, just in the last minute repos that updated include:

    https://github.com/drzax/queensland-traffic-conditions

    https://github.com/jasoncartwright/bbcrss

    https://github.com/jackharrhy/metrobus-timetrack-history

    https://github.com/outages/bchydro-outages

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • metrobus-timetrack-history

    Tracking Metrobus location data

  • I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

    A fun way to track how people are using this is with the git-scraping topic on GitHub:

    https://github.com/topics/git-scraping?o=desc&s=updated

    That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

    As I write this, just in the last minute repos that updated include:

    https://github.com/drzax/queensland-traffic-conditions

    https://github.com/jasoncartwright/bbcrss

    https://github.com/jackharrhy/metrobus-timetrack-history

    https://github.com/outages/bchydro-outages

  • bchydro-outages

    Track BCHydro Outages via Git history

  • I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

    A fun way to track how people are using this is with the git-scraping topic on GitHub:

    https://github.com/topics/git-scraping?o=desc&s=updated

    That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

    As I write this, just in the last minute repos that updated include:

    https://github.com/drzax/queensland-traffic-conditions

    https://github.com/jasoncartwright/bbcrss

    https://github.com/jackharrhy/metrobus-timetrack-history

    https://github.com/outages/bchydro-outages

  • gh-action-data-scraping

    this shows how to use github actions to do periodic data scraping

  • i do this as a demo: https://github.com/swyxio/gh-action-data-scraping

    but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...

  • Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.

    I think it's fine to use the term "scraping" to refer to downloading a JSON file.

    These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.

    I do run Git scrapers that process HTML as well. A couple of examples:

    scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.

    scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

  • scrape-hacker-news-by-domain

    Scrape HN to track links from specific domains

  • Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.

    I think it's fine to use the term "scraping" to refer to downloading a JSON file.

    These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.

    I do run Git scrapers that process HTML as well. A couple of examples:

    scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.

    scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • shot-scraper

    A command-line utility for taking automated screenshots of websites

  • Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.

    I think it's fine to use the term "scraping" to refer to downloading a JSON file.

    These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.

    I do run Git scrapers that process HTML as well. A couple of examples:

    scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.

    scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

  • carbon-intensity-forecast-tracking

    The reliability of the National Grid's Carbon Intensity forecast

  • I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!

    https://github.com/nmpowell/carbon-intensity-forecast-tracki...

  • I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!

    https://github.com/nmpowell/carbon-intensity-forecast-tracki...

  • mastodon-scraping

    Repository for scraping public information from Mastodon

  • Thanks for linking to the topic, that was interesting

    As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")

    For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed

    However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered

      diff -u \

  • masscan_as_a_service

    masscan as a service

  • I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.

    https://github.com/bobek/masscan_as_a_service

  • torvenyek

    Magyar törvények git repo

  • hun_law_rs

    Tool for parsing hungarian laws (Rust version)

  • hun_law_py

    Tools for parsing hungarian legal documents

  • github-actions

    Infromation and tips regarding GitHub Actions (by TomasHubelbauer)

  • They have the right icon, clickable username and it is as simple as just using this email and name. You or someone else might like to do this, too, so here's me sharing this neat trick I found.

    https://github.com/TomasHubelbauer/github-actions#write-work...

  • Geo-IP-Database

    Automatically updated tree-formatted database from MaxMind database

  • I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).

    https://github.com/Ayesh/Geo-IP-Database/

    It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.

  • gesetze-im-internet

    Archive of German legal acts (weekly archive of gesetze-im-internet.de) (by jandinter)

  • https://github.com/jandinter/gesetze-im-internet

    Parsing the legal acts with the tools you mention looks very interesting! Currently, I simply collect the published XML files whose structure is optimized for laying out the text and not so much for representing a structure of sections and subsections.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • 2024-03-01 listening in on the neighborhood

    5 projects | news.ycombinator.com | 2 Mar 2024
  • A command-line utility for taking automated screenshots of websites

    1 project | news.ycombinator.com | 15 Dec 2023
  • Web Scraping via JavaScript Runtime Heap Snapshots (2022)

    1 project | news.ycombinator.com | 8 Aug 2023
  • Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next

    1 project | /r/webscraping | 8 May 2023
  • Need help with downloading a section of multiple sites as pdf files.

    2 projects | /r/webscraping | 25 Mar 2023