Top 23 git-scraping Open-Source Projects

github-stats

5 2,722 9.5 Python

Better GitHub statistics images for your profile, with stats from private repos too

Project mention: Ask HN: How to Do a GitHub Wrapped? | news.ycombinator.com | 2023-12-19

I have done similar work using the GitHub APIs before. I recommend using their GraphQL explorer to develop your queries interactively. You may need to fall back on the REST API instead of the GraphQL one for certain stats.
https://docs.github.com/en/graphql/overview/explorer
You can also refer to my code here, which may already collect some of the statistics you're interested in.
https://github.com/jstrieb/github-stats/blob/master/github_s...
I predict the most annoying part of this project will be dealing with authentication. There are a handful of ways to do it, and the permissions can be finicky depending on what data you are fetching.
Best of luck!

nyt-2020-election-scraper

9 1,761 0.0 HTML
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
factbook.json

7 965 7.8

World Factbook Country Profiles in JSON - Free Open Public Domain Data - No API Key Required ;-)
spotify-playlist-archive

10 382 0.0 Python

Daily snapshots of public Spotify playlists

Project mention: Git Scraping Spotify | news.ycombinator.com | 2023-08-11

csv-diff

1 273 0.0 Python

Python CLI tool and library for diffing CSV and JSON files
gh-action-data-scraping

1 212 0.0 JavaScript

this shows how to use github actions to do periodic data scraping

Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

i do this as a demo: https://github.com/swyxio/gh-action-data-scraping
but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...

california-coronavirus-scrapers

16 56 9.9 Jupyter Notebook

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

Project mention: Am I wrong- to pull the plug on a European vacation 6 weeks before- my travel partner requested I “behave like in pandemic lockdown”. | /r/amiwrong | 2023-06-19

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
sf-tree-history

3 40 8.5

Tracking the history of trees in San Francisco

Project mention: Open Data Is Dead | news.ycombinator.com | 2023-11-01

I think this headline was poorly chosen.
When I see the term "Open Data" I instantly think of open data portals - mostly run by governments around the world. These things have never been healthier: ten years ago they hardly existed, today you can get civic data from local governments all over the place (last time I saw an attempt to count there were over 4,000 of these portals, and that was a few years ago).
My favourite example is still this CSV of all 190,000+ trees in San Francisco, which is updated most business days with details of the latest tree changes: https://data.sfgov.org/City-Infrastructure/Street-Tree-List/... - I track changes to it here: https://github.com/simonw/sf-tree-history/
This article is about something different: it's about what I guess you could call the "Open APIs" movement. Back in the days of Web 2.0 every service was launching an open API, hoping to harness developer attention to help make the platforms more sticky. Facebook and Twitter both did incredibly well out of this strategy, at least at first.
THOSE APIs are mostly on the way out now. Companies realized that giving away their data for free has a lot of disadvantages.
Open Data is doing great. Open APIs are not.

help-scraper

2 40 9.7 Python

Record a history of --help for various commands
scrape-hacker-news-by-domain

4 34 9.9 JavaScript

Scrape HN to track links from specific domains

Project mention: London Street Trees | news.ycombinator.com | 2023-09-07

Yeah I have a bunch of these using pretty-printed JSON - here's one that scrapes Hacker News for mentions of my site, for example: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...

randbats

2 28 9.8 JavaScript

Pokémon Showdown's Random Battle sets

Project mention: Inaccurate sets on Gen 6 Randbats | /r/stunfisk | 2023-11-10

india-isin-data

2 26 9.3 Shell

International Securities Identification Numbers for various Indian Securities
nepstonks

4 22 8.2 Python

An automated bot that scrapes the latest upcoming issues, news, and investment opportunities that are announced inside Nepal and sends them to a telegram channel.
quacs-data

15 15 2.1 Rust

A repository holding all the data used on QuACS.org

Project mention: Should I pick a random roommate or try out the roommate search(plus a few other questions)? | /r/RPI | 2023-05-05

data

4 11 1.6

Latest data on UK food banks from Give Food scraped from our API and republished in various formats. (by givefood)
mcbroken-archive

58 7 0.0

:inbox_tray: Archive for data from mcbroken.com.

Project mention: Anyone know where I can get a McFlurry? | /r/Columbus | 2023-12-09

Try mcbroken.com.

bchydro-outages

1 5 0.6

Track BCHydro Outages via Git history

Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages

carbon-intensity-forecast-tracking

1 2 8.7 Jupyter Notebook

The reliability of the National Grid's Carbon Intensity forecast

Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!
https://github.com/nmpowell/carbon-intensity-forecast-tracki...

mastodon-scraping

1 2 0.0

Repository for scraping public information from Mastodon

Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

Thanks for linking to the topic, that was interesting
As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")
For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed
However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered
  diff -u \

lvms-events

1 2 8.6 HTML

LVMS Events iCal feed
scrape-san-mateo-fire-dispatch

1 1 0.0 HTML

Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

scrape-la-fires

1 1 9.7
metrobus-timetrack-history

1 1 2.5 Shell

Tracking Metrobus location data

Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

git-scraping related posts

Anyone know where I can get a McFlurry?

1 project | /r/Columbus | 9 Dec 2023
Game Thread: St Louis Blues (8-6-1) at Los Angeles Kings (9-3-3) - 18 Nov 2023 - 07:30PM PST

1 project | /r/hockey | 20 Nov 2023
Inaccurate sets on Gen 6 Randbats

1 project | /r/stunfisk | 10 Nov 2023
London Street Trees

5 projects | news.ycombinator.com | 7 Sep 2023
Why are McDonald's ice cream machines always broken?

1 project | /r/technology | 3 Sep 2023
iFixit Petitions Government for Right to Hack McDonald's Ice Cream Machine

1 project | news.ycombinator.com | 29 Aug 2023
Git scraping: track changes over time by scraping to a Git repository

18 projects | news.ycombinator.com | 10 Aug 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 3 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source git-scraping projects? This list will help you:

	Project	Stars
1	github-stats	2,722
2	nyt-2020-election-scraper	1,761
3	factbook.json	965
4	spotify-playlist-archive	382
5	csv-diff	273
6	gh-action-data-scraping	212
7	california-coronavirus-scrapers	56
8	sf-tree-history	40
9	help-scraper	40
10	scrape-hacker-news-by-domain	34
11	randbats	28
12	india-isin-data	26
13	nepstonks	22
14	quacs-data	15
15	data	11
16	mcbroken-archive	7
17	bchydro-outages	5
18	carbon-intensity-forecast-tracking	2
19	mastodon-scraping	2
20	lvms-events	2
21	scrape-san-mateo-fire-dispatch	1
22	scrape-la-fires	1
23	metrobus-timetrack-history	1

git-scraping

Top 23 git-scraping Open-Source Projects

git-scraping related posts

Anyone know where I can get a McFlurry?

Game Thread: St Louis Blues (8-6-1) at Los Angeles Kings (9-3-3) - 18 Nov 2023 - 07:30PM PST

Inaccurate sets on Gen 6 Randbats

London Street Trees

Why are McDonald's ice cream machines always broken?

iFixit Petitions Government for Right to Hack McDonald's Ice Cream Machine

Git scraping: track changes over time by scraping to a Git repository

Index