Git scraping: track changes over time by scraping to a Git repository

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

queensland-traffic-conditions

1 0 0.0

A scraper that tracks changes to the published queensland traffic incidents data

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages

bbcrss

1 5 10.0 XSLT

Discontinued Scrapes the headlines from BBC News indexes every five minutes

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
metrobus-timetrack-history

1 1 2.5 Shell

Tracking Metrobus location data

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages

bchydro-outages

1 5 0.6

Track BCHydro Outages via Git history

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages

gh-action-data-scraping

1 212 0.0 JavaScript

this shows how to use github actions to do periodic data scraping

i do this as a demo: https://github.com/swyxio/gh-action-data-scraping
but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...

scrape-san-mateo-fire-dispatch

1 1 0.0 HTML

Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

scrape-hacker-news-by-domain

4 34 9.9 JavaScript

Scrape HN to track links from specific domains

Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
shot-scraper

16 1,535 7.1 Python

A command-line utility for taking automated screenshots of websites

Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

carbon-intensity-forecast-tracking

1 2 8.7 Jupyter Notebook

The reliability of the National Grid's Carbon Intensity forecast

I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!
https://github.com/nmpowell/carbon-intensity-forecast-tracki...

carbon-intensity-forecast-tracki

1 - -

I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!
https://github.com/nmpowell/carbon-intensity-forecast-tracki...

mastodon-scraping

1 2 0.0

Repository for scraping public information from Mastodon

Thanks for linking to the topic, that was interesting
As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")
For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed
However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered
  diff -u \

masscan_as_a_service

3 22 0.0 Python

masscan as a service

I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.
https://github.com/bobek/masscan_as_a_service

torvenyek

4 82 1.8 HTML

Magyar törvények git repo
hun_law_rs

1 9 2.7 Rust

Tool for parsing hungarian laws (Rust version)
hun_law_py

1 16 10.0 Python

Tools for parsing hungarian legal documents
github-actions

1 6 10.0 Markdown

Infromation and tips regarding GitHub Actions (by TomasHubelbauer)

They have the right icon, clickable username and it is as simple as just using this email and name. You or someone else might like to do this, too, so here's me sharing this neat trick I found.
https://github.com/TomasHubelbauer/github-actions#write-work...

Geo-IP-Database

1 8 8.4

Automatically updated tree-formatted database from MaxMind database

I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).
https://github.com/Ayesh/Geo-IP-Database/
It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.

gesetze-im-internet

1 1 1.5 Ruby

Archive of German legal acts (weekly archive of gesetze-im-internet.de) (by jandinter)

https://github.com/jandinter/gesetze-im-internet
Parsing the legal acts with the tools you mention looks very interesting! Currently, I simply collect the published XML files whose structure is optimized for laying out the text and not so much for representing a structure of sections and subsections.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

2024-03-01 listening in on the neighborhood

5 projects | news.ycombinator.com | 2 Mar 2024
A command-line utility for taking automated screenshots of websites

1 project | news.ycombinator.com | 15 Dec 2023
Web Scraping via JavaScript Runtime Heap Snapshots (2022)

1 project | news.ycombinator.com | 8 Aug 2023
Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next

1 project | /r/webscraping | 8 May 2023
Need help with downloading a section of multiple sites as pdf files.

2 projects | /r/webscraping | 25 Mar 2023

Git scraping: track changes over time by scraping to a Git repository

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
git-scraping Scraping Cloud Playwright bbc
Post date: 10 Aug 2023

queensland-traffic-conditions

bbcrss

InfluxDB

metrobus-timetrack-history

bchydro-outages

gh-action-data-scraping

scrape-san-mateo-fire-dispatch

scrape-hacker-news-by-domain

SaaSHub

shot-scraper

carbon-intensity-forecast-tracking

carbon-intensity-forecast-tracki

mastodon-scraping

masscan_as_a_service

torvenyek

hun_law_rs

hun_law_py

github-actions

Geo-IP-Database

gesetze-im-internet

SaaSHub

Related posts

2024-03-01 listening in on the neighborhood

A command-line utility for taking automated screenshots of websites

Web Scraping via JavaScript Runtime Heap Snapshots (2022)

Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next

Need help with downloading a section of multiple sites as pdf files.

Git scraping: track changes over time by scraping to a Git repository

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com git-scraping Scraping Cloud Playwright bbc Post date: 10 Aug 2023

Related posts

2024-03-01 listening in on the neighborhood

A command-line utility for taking automated screenshots of websites

Web Scraping via JavaScript Runtime Heap Snapshots (2022)

Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next

Need help with downloading a section of multiple sites as pdf files.

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
git-scraping Scraping Cloud Playwright bbc
Post date: 10 Aug 2023