git-scraping

Open-source projects categorized as git-scraping

Top 23 git-scraping Open-Source Projects

  • github-stats

    Better GitHub statistics images for your profile, with stats from private repos too

  • Project mention: Ask HN: How to Do a GitHub Wrapped? | news.ycombinator.com | 2023-12-19

    I have done similar work using the GitHub APIs before. I recommend using their GraphQL explorer to develop your queries interactively. You may need to fall back on the REST API instead of the GraphQL one for certain stats.

    https://docs.github.com/en/graphql/overview/explorer

    You can also refer to my code here, which may already collect some of the statistics you're interested in.

    https://github.com/jstrieb/github-stats/blob/master/github_s...

    I predict the most annoying part of this project will be dealing with authentication. There are a handful of ways to do it, and the permissions can be finicky depending on what data you are fetching.

    Best of luck!

  • nyt-2020-election-scraper

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • factbook.json

    World Factbook Country Profiles in JSON - Free Open Public Domain Data - No API Key Required ;-)

  • spotify-playlist-archive

    Daily snapshots of public Spotify playlists

  • Project mention: Git Scraping Spotify | news.ycombinator.com | 2023-08-11
  • csv-diff

    Python CLI tool and library for diffing CSV and JSON files

  • gh-action-data-scraping

    this shows how to use github actions to do periodic data scraping

  • Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

    i do this as a demo: https://github.com/swyxio/gh-action-data-scraping

    but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...

  • california-coronavirus-scrapers

    The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

  • Project mention: Am I wrong- to pull the plug on a European vacation 6 weeks before- my travel partner requested I “behave like in pandemic lockdown”. | /r/amiwrong | 2023-06-19
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • sf-tree-history

    Tracking the history of trees in San Francisco

  • Project mention: Open Data Is Dead | news.ycombinator.com | 2023-11-01

    I think this headline was poorly chosen.

    When I see the term "Open Data" I instantly think of open data portals - mostly run by governments around the world. These things have never been healthier: ten years ago they hardly existed, today you can get civic data from local governments all over the place (last time I saw an attempt to count there were over 4,000 of these portals, and that was a few years ago).

    My favourite example is still this CSV of all 190,000+ trees in San Francisco, which is updated most business days with details of the latest tree changes: https://data.sfgov.org/City-Infrastructure/Street-Tree-List/... - I track changes to it here: https://github.com/simonw/sf-tree-history/

    This article is about something different: it's about what I guess you could call the "Open APIs" movement. Back in the days of Web 2.0 every service was launching an open API, hoping to harness developer attention to help make the platforms more sticky. Facebook and Twitter both did incredibly well out of this strategy, at least at first.

    THOSE APIs are mostly on the way out now. Companies realized that giving away their data for free has a lot of disadvantages.

    Open Data is doing great. Open APIs are not.

  • help-scraper

    Record a history of --help for various commands

  • scrape-hacker-news-by-domain

    Scrape HN to track links from specific domains

  • Project mention: London Street Trees | news.ycombinator.com | 2023-09-07

    Yeah I have a bunch of these using pretty-printed JSON - here's one that scrapes Hacker News for mentions of my site, for example: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...

  • randbats

    Pokémon Showdown's Random Battle sets

  • Project mention: Inaccurate sets on Gen 6 Randbats | /r/stunfisk | 2023-11-10
  • india-isin-data

    International Securities Identification Numbers for various Indian Securities

  • nepstonks

    An automated bot that scrapes the latest upcoming issues, news, and investment opportunities that are announced inside Nepal and sends them to a telegram channel.

  • quacs-data

    A repository holding all the data used on QuACS.org

  • Project mention: Should I pick a random roommate or try out the roommate search(plus a few other questions)? | /r/RPI | 2023-05-05
  • data

    Latest data on UK food banks from Give Food scraped from our API and republished in various formats. (by givefood)

  • mcbroken-archive

    :inbox_tray: Archive for data from mcbroken.com.

  • Project mention: Anyone know where I can get a McFlurry? | /r/Columbus | 2023-12-09

    Try mcbroken.com.

  • bchydro-outages

    Track BCHydro Outages via Git history

  • Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

    I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

    A fun way to track how people are using this is with the git-scraping topic on GitHub:

    https://github.com/topics/git-scraping?o=desc&s=updated

    That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

    As I write this, just in the last minute repos that updated include:

    https://github.com/drzax/queensland-traffic-conditions

    https://github.com/jasoncartwright/bbcrss

    https://github.com/jackharrhy/metrobus-timetrack-history

    https://github.com/outages/bchydro-outages

  • carbon-intensity-forecast-tracking

    The reliability of the National Grid's Carbon Intensity forecast

  • Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

    I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!

    https://github.com/nmpowell/carbon-intensity-forecast-tracki...

  • mastodon-scraping

    Repository for scraping public information from Mastodon

  • Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

    Thanks for linking to the topic, that was interesting

    As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")

    For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed

    However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered

      diff -u \

  • lvms-events

    LVMS Events iCal feed

  • Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

    Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.

    I think it's fine to use the term "scraping" to refer to downloading a JSON file.

    These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.

    I do run Git scrapers that process HTML as well. A couple of examples:

    scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.

    scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

  • metrobus-timetrack-history

    Tracking Metrobus location data

  • Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10

    I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

    A fun way to track how people are using this is with the git-scraping topic on GitHub:

    https://github.com/topics/git-scraping?o=desc&s=updated

    That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

    As I write this, just in the last minute repos that updated include:

    https://github.com/drzax/queensland-traffic-conditions

    https://github.com/jasoncartwright/bbcrss

    https://github.com/jackharrhy/metrobus-timetrack-history

    https://github.com/outages/bchydro-outages

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

git-scraping related posts

  • Anyone know where I can get a McFlurry?

    1 project | /r/Columbus | 9 Dec 2023
  • Game Thread: St Louis Blues (8-6-1) at Los Angeles Kings (9-3-3) - 18 Nov 2023 - 07:30PM PST

    1 project | /r/hockey | 20 Nov 2023
  • Inaccurate sets on Gen 6 Randbats

    1 project | /r/stunfisk | 10 Nov 2023
  • London Street Trees

    5 projects | news.ycombinator.com | 7 Sep 2023
  • Why are McDonald's ice cream machines always broken?

    1 project | /r/technology | 3 Sep 2023
  • iFixit Petitions Government for Right to Hack McDonald's Ice Cream Machine

    1 project | news.ycombinator.com | 29 Aug 2023
  • Git scraping: track changes over time by scraping to a Git repository

    18 projects | news.ycombinator.com | 10 Aug 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 3 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com