gh-action-data-scraping VS scrape-hacker-news-by-domain

Compare gh-action-data-scraping vs scrape-hacker-news-by-domain and see what are their differences.

gh-action-data-scraping

this shows how to use github actions to do periodic data scraping (by swyxio)

scrape-hacker-news-by-domain

Scrape HN to track links from specific domains (by simonw)
SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
surveyjs.io
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
gh-action-data-scraping scrape-hacker-news-by-domain
1 4
212 34
- -
0.0 9.9
6 days ago 9 days ago
JavaScript JavaScript
MIT License -
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

gh-action-data-scraping

Posts with mentions or reviews of gh-action-data-scraping. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-08-10.

scrape-hacker-news-by-domain

Posts with mentions or reviews of scrape-hacker-news-by-domain. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-09-07.
  • London Street Trees
    5 projects | news.ycombinator.com | 7 Sep 2023
    Yeah I have a bunch of these using pretty-printed JSON - here's one that scrapes Hacker News for mentions of my site, for example: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...
  • Git scraping: track changes over time by scraping to a Git repository
    18 projects | news.ycombinator.com | 10 Aug 2023
    Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.

    I think it's fine to use the term "scraping" to refer to downloading a JSON file.

    These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.

    I do run Git scrapers that process HTML as well. A couple of examples:

    scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.

    scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

  • Ask HN: Small scripts, hacks and automations you're proud of?
    49 projects | news.ycombinator.com | 12 Mar 2023
    I have a neat Hacker News scraping setup that I'm really pleased with.

    The problem: I want to know when content from one of my sites is submitted to Hacker News, and keep track of the points and comments over time. I also want to be alerted when it happens.

    Solution: https://github.com/simonw/scrape-hacker-news-by-domain/

    This repo does a LOT of things.

    It's an implementation of my Git scraping pattern - https://simonwillison.net/2020/Oct/9/git-scraping/ - in that it runs a script once an hour to check for more content.

    It scrapes https://news.ycombinator.com/from?site=simonwillison.net (scraping the HTML because this particular feature isn't supported by the Hacker News API) using shot-scraper - a tool I built for command-line browser automation: https://shot-scraper.datasette.io/

    The scraper works by running this JavaScript against the page and recording the resulting JSON to the Git repository: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...

    That solves the "monitor and record any changes" bit.

    But... I want alerts when my content shows up.

    I solve that using three more tools I built: https://datasette.io/ and https://datasette.io/plugins/datasette-atom and https://datasette.cloud/

    This script here runs to push the latest scraped JSON to my SQLite database hosted using my in-development SaaS platform, Datasette Cloud: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...

    I defined this SQL view https://simon.datasette.cloud/data/hacker_news_posts_atom which shows the latest data in the format required by the datasette-atom plugin.

    Which means I can subscribe to the resulting Atom feed (add .atom to that URL) in NetNewsWire and get alerted when my content shows up on Hacker News!

    I wrote a bit more about how this all works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

  • Datasette’s new JSON write API: The first alpha of Datasette 1.0
    3 projects | news.ycombinator.com | 2 Dec 2022
    I'm really pleased with the Hacker News scraping demo in this - it's an extension of the scraper I wrote back in March, using shot-scraper to execute JavaScript in headless Chrome and write the resulting JSON back to a Git repo: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

    My new demo also then pipes that data up to Datasette using curl -X POST - this script here: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...

What are some alternatives?

When comparing gh-action-data-scraping and scrape-hacker-news-by-domain you can also consider the following projects:

hun_law_py - Tools for parsing hungarian legal documents

scrape-san-mateo-fire-dispatch

bchydro-outages - Track BCHydro Outages via Git history

shot-scraper - A command-line utility for taking automated screenshots of websites

hun_law_rs - Tool for parsing hungarian laws (Rust version)

zettelkasten - Creating notes with the zettelkasten note taking method and storing all notes on github

gesetze-im-internet - Archive of German legal acts (weekly archive of gesetze-im-internet.de)

metrobus-timetrack-history - Tracking Metrobus location data

sf-tree-history - Tracking the history of trees in San Francisco

queensland-traffic-conditions - A scraper that tracks changes to the published queensland traffic incidents data