hun_law_rs
scrape-hacker-news-by-domain
hun_law_rs | scrape-hacker-news-by-domain | |
---|---|---|
1 | 4 | |
9 | 49 | |
- | - | |
2.7 | 10.0 | |
over 1 year ago | 5 days ago | |
Rust | JavaScript | |
GNU General Public License v3.0 only | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
hun_law_rs
scrape-hacker-news-by-domain
-
London Street Trees
Yeah I have a bunch of these using pretty-printed JSON - here's one that scrapes Hacker News for mentions of my site, for example: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...
-
Git scraping: track changes over time by scraping to a Git repository
Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/
- Ask HN: Small scripts, hacks and automations you're proud of?
-
Datasette’s new JSON write API: The first alpha of Datasette 1.0
I'm really pleased with the Hacker News scraping demo in this - it's an extension of the scraper I wrote back in March, using shot-scraper to execute JavaScript in headless Chrome and write the resulting JSON back to a Git repo: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...
My new demo also then pipes that data up to Datasette using curl -X POST - this script here: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...
What are some alternatives?
carbon-intensity-forecast-tracking - The reliability of the National Grid's Carbon Intensity forecast
shot-scraper - A command-line utility for taking automated screenshots of websites
github-actions - Infromation and tips regarding GitHub Actions
scrape-san-mateo-fire-dispatch
metrobus-timetrack-history - Tracking Metrobus location data
semanticText - Copy paste tool that analyzes the semantic description of all text in the DOM
gesetze-im-internet - Archive of German legal acts (weekly archive of gesetze-im-internet.de)
sf-tree-history - Tracking the history of trees in San Francisco
bchydro-outages - Track BCHydro Outages via Git history
bbcrss - Scrapes the headlines from BBC News indexes every five minutes
queensland-traffic-conditions - A scraper that tracks changes to the published queensland traffic incidents data