Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 git-scraping Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
factbook.json
World Factbook Country Profiles in JSON - Free Open Public Domain Data - No API Key Required ;-)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
nepstonks
An automated bot that scrapes the latest upcoming issues, news, and investment opportunities that are announced inside Nepal and sends them to a telegram channel.
-
data
Latest data on UK food banks from Give Food scraped from our API and republished in various formats. (by givefood)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
I have done similar work using the GitHub APIs before. I recommend using their GraphQL explorer to develop your queries interactively. You may need to fall back on the REST API instead of the GraphQL one for certain stats.
https://docs.github.com/en/graphql/overview/explorer
You can also refer to my code here, which may already collect some of the statistics you're interested in.
https://github.com/jstrieb/github-stats/blob/master/github_s...
I predict the most annoying part of this project will be dealing with authentication. There are a handful of ways to do it, and the permissions can be finicky depending on what data you are fetching.
Best of luck!
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10i do this as a demo: https://github.com/swyxio/gh-action-data-scraping
but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...
Project mention: Am I wrong- to pull the plug on a European vacation 6 weeks before- my travel partner requested I “behave like in pandemic lockdown”. | /r/amiwrong | 2023-06-19
I think this headline was poorly chosen.
When I see the term "Open Data" I instantly think of open data portals - mostly run by governments around the world. These things have never been healthier: ten years ago they hardly existed, today you can get civic data from local governments all over the place (last time I saw an attempt to count there were over 4,000 of these portals, and that was a few years ago).
My favourite example is still this CSV of all 190,000+ trees in San Francisco, which is updated most business days with details of the latest tree changes: https://data.sfgov.org/City-Infrastructure/Street-Tree-List/... - I track changes to it here: https://github.com/simonw/sf-tree-history/
This article is about something different: it's about what I guess you could call the "Open APIs" movement. Back in the days of Web 2.0 every service was launching an open API, hoping to harness developer attention to help make the platforms more sticky. Facebook and Twitter both did incredibly well out of this strategy, at least at first.
THOSE APIs are mostly on the way out now. Companies realized that giving away their data for free has a lot of disadvantages.
Open Data is doing great. Open APIs are not.
Yeah I have a bunch of these using pretty-printed JSON - here's one that scrapes Hacker News for mentions of my site, for example: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...
Project mention: Should I pick a random roommate or try out the roommate search(plus a few other questions)? | /r/RPI | 2023-05-05
Try mcbroken.com.
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!
https://github.com/nmpowell/carbon-intensity-forecast-tracki...
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10Thanks for linking to the topic, that was interesting
As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")
For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed
However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered
diff -u \
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
https://github.com/drzax/queensland-traffic-conditions
https://github.com/jasoncartwright/bbcrss
https://github.com/jackharrhy/metrobus-timetrack-history
https://github.com/outages/bchydro-outages
git-scraping related posts
-
Anyone know where I can get a McFlurry?
-
Game Thread: St Louis Blues (8-6-1) at Los Angeles Kings (9-3-3) - 18 Nov 2023 - 07:30PM PST
-
Inaccurate sets on Gen 6 Randbats
-
London Street Trees
-
Why are McDonald's ice cream machines always broken?
-
iFixit Petitions Government for Right to Hack McDonald's Ice Cream Machine
-
Git scraping: track changes over time by scraping to a Git repository
-
A note from our sponsor - InfluxDB
www.influxdata.com | 3 May 2024
Index
What are some of the best open-source git-scraping projects? This list will help you:
Project | Stars | |
---|---|---|
1 | github-stats | 2,722 |
2 | nyt-2020-election-scraper | 1,761 |
3 | factbook.json | 965 |
4 | spotify-playlist-archive | 382 |
5 | csv-diff | 273 |
6 | gh-action-data-scraping | 212 |
7 | california-coronavirus-scrapers | 56 |
8 | sf-tree-history | 40 |
9 | help-scraper | 40 |
10 | scrape-hacker-news-by-domain | 34 |
11 | randbats | 28 |
12 | india-isin-data | 26 |
13 | nepstonks | 22 |
14 | quacs-data | 15 |
15 | data | 11 |
16 | mcbroken-archive | 7 |
17 | bchydro-outages | 5 |
18 | carbon-intensity-forecast-tracking | 2 |
19 | mastodon-scraping | 2 |
20 | lvms-events | 2 |
21 | scrape-san-mateo-fire-dispatch | 1 |
22 | scrape-la-fires | 1 |
23 | metrobus-timetrack-history | 1 |
Sponsored