dude vs shot-scraper

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

dude		shot-scraper
	Project
28	Mentions	16
412	Stars	1,541
-	Growth	-
9.0	Activity	7.1
7 days ago	Latest Commit	about 2 months ago
Python	Language	Python
GNU Affero General Public License v3.0	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

dude

Posts with mentions or reviews of dude. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-03-25.

Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next
1 project | /r/webscraping | 8 May 2023

Please check https://github.com/roniemartinez/dude
Need help with downloading a section of multiple sites as pdf files.
2 projects | /r/webscraping | 25 Mar 2023

You can use my library which also uses Playwright. I have an example here: https://github.com/roniemartinez/dude/discussions/116
Why do you use python for web scraping?
1 project | /r/webscraping | 11 Oct 2022

I also built a framework so I can easily switch between these libraries with less code change (still on hiatus for a few months before going back to it): https://github.com/roniemartinez/dude
Thank GOD for Poetry!
3 projects | /r/Python | 28 Sep 2022

There's a lot of options but I am quite happy with Github Actions workflows + Poetry as it handles tests and publish to PyPI. Just an example, in my workflows, I deploy to TestPyPI and PyPI here: https://github.com/roniemartinez/dude/tree/master/.github/workflows
What stack or tools are you using for ensuring code quality and best practices in medium and large codebases ?
2 projects | /r/Python | 15 Sep 2022

But for documentation, I use mkdocs-material as it can easily be used with minor customization and changes can be easily deployed in Github: https://roniemartinez.github.io/dude/
Is there any thing Beautifulsoup can do that Scrapy can not?
1 project | /r/webscraping | 8 Aug 2022
Screenshotting site, but remove all popups.
1 project | /r/webscraping | 21 Jul 2022

Add an adblocker. I implemented Dude/pydude with the this and page results are clean without ads and pop-ups. For the screenshot, here is an example: https://github.com/roniemartinez/dude/discussions/116
which Python Library is best for scraping?
2 projects | /r/webscraping | 26 Jun 2022

You can also use my library if you want things to be simpler:) https://github.com/roniemartinez/dude
For those of you using Python, what is your go to library to build your scraper?
1 project | /r/webscraping | 4 Jun 2022

I use my own library, Dude! https://github.com/roniemartinez/dude
Building a (relatively) easily adaptable, flexible web scraper (seeking conceptual advice)
1 project | /r/webscraping | 11 Apr 2022

I built a simple web scraper that is simple to use but this is still a work-in-progress - https://github.com/roniemartinez/dude

shot-scraper

Posts with mentions or reviews of shot-scraper. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-04-15.

I want to create IMDB for Open source projects
6 projects | news.ycombinator.com | 15 Apr 2024

I had one of these recently! https://github.com/simonw/shot-scraper/pull/133/files
They're /incredibly/ rare though.
2024-03-01 listening in on the neighborhood
5 projects | news.ycombinator.com | 2 Mar 2024
If anyone wants the raw data, it's available in window._Flourish_data variable on https://flo.uri.sh/visualisation/16818696/embed
Which means you can extract it with my https://shot-scraper.datasette.io/ tool like this:
```
    shot-scraper javascript \
```
Web Scraping in Python – The Complete Guide
11 projects | news.ycombinator.com | 20 Feb 2024

I strongly recommend adding Playwright to your set of tools for Python web scraping. It's by far the most powerful and best designed browser automation tool I've ever worked with.
I use it for my shot-scraper CLI tool: https://shot-scraper.datasette.io/ - which lets you scrape web pages directly from the command line by running JavaScript against pages to extract JSON data: https://shot-scraper.datasette.io/en/stable/javascript.html
A command-line utility for taking automated screenshots of websites
1 project | news.ycombinator.com | 15 Dec 2023
Don’t Build a General Purpose API to Power Your Own Front End (2021)
3 projects | news.ycombinator.com | 20 Aug 2023

This is exactly what the `Accept` HTTP header is for https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...
I think the author is generally correct that all JSON should be provided in a single request, but if you want to prove it, then you should be able to change your accept header to and from `application/json`/`text/html seeing nearly identical data.
In fact, this is what both GitLab and Github do. Try it out!
`curl -L https://github.com/simonw/shot-scraper` (text/html)
`curl --header "Accept: application/json" -L https://github.com/simonw/shot-scraper` (application/json)
Git scraping: track changes over time by scraping to a Git repository
18 projects | news.ycombinator.com | 10 Aug 2023

Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/
Web Scraping via JavaScript Runtime Heap Snapshots (2022)
1 project | news.ycombinator.com | 8 Aug 2023
Need help with downloading a section of multiple sites as pdf files.
2 projects | /r/webscraping | 25 Mar 2023

You can use shot-scraper: https://github.com/simonw/shot-scraper
Ask HN: Small scripts, hacks and automations you're proud of?
49 projects | news.ycombinator.com | 12 Mar 2023

I have a neat Hacker News scraping setup that I'm really pleased with.
The problem: I want to know when content from one of my sites is submitted to Hacker News, and keep track of the points and comments over time. I also want to be alerted when it happens.
Solution: https://github.com/simonw/scrape-hacker-news-by-domain/
This repo does a LOT of things.
It's an implementation of my Git scraping pattern - https://simonwillison.net/2020/Oct/9/git-scraping/ - in that it runs a script once an hour to check for more content.
It scrapes https://news.ycombinator.com/from?site=simonwillison.net (scraping the HTML because this particular feature isn't supported by the Hacker News API) using shot-scraper - a tool I built for command-line browser automation: https://shot-scraper.datasette.io/
The scraper works by running this JavaScript against the page and recording the resulting JSON to the Git repository: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...
That solves the "monitor and record any changes" bit.
But... I want alerts when my content shows up.
I solve that using three more tools I built: https://datasette.io/ and https://datasette.io/plugins/datasette-atom and https://datasette.cloud/
This script here runs to push the latest scraped JSON to my SQLite database hosted using my in-development SaaS platform, Datasette Cloud: https://github.com/simonw/scrape-hacker-news-by-domain/blob/...
I defined this SQL view https://simon.datasette.cloud/data/hacker_news_posts_atom which shows the latest data in the format required by the datasette-atom plugin.
Which means I can subscribe to the resulting Atom feed (add .atom to that URL) in NetNewsWire and get alerted when my content shows up on Hacker News!
I wrote a bit more about how this all works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/
Show HN: Plus – Self Updating Screenshots
3 projects | news.ycombinator.com | 17 Jan 2023

Sounds a lot like Simon Willison's open source project shot-scraper
https://github.com/simonw/shot-scraper

What are some alternatives?

When comparing dude and shot-scraper you can also consider the following projects:

Edu-Mail-Generator - Generate Free Edu Mail(s) within minutes

gmail-sidebar-drive - A simple gmail add on to display all the drive folders and files in sidebar.

python-web-scraping-primjeri - web scraping stranica posta.hr, konzum.hr, index.hr, njuskalo.hr, neostar.com, DasWeltAuto.hr, ...

zettelkasten - Creating notes with the zettelkasten note taking method and storing all notes on github

scrapy-playwright - 🎭 Playwright integration for Scrapy

scrape-san-mateo-fire-dispatch

FastDepends - FastDepends - FastAPI Dependency Injection system extracted from FastAPI and cleared of all HTTP logic. Async and sync modes are both supported.

bbcrss - Scrapes the headlines from BBC News indexes every five minutes

HomeHarvest - Python package for real estate scraping of MLS listing data [Moved to: https://github.com/Bunsly/HomeHarvest]

scrape-hacker-news-by-domain - Scrape HN to track links from specific domains

dnd-roll-parser - Python project that will take the saved html chat log and calculate the average rolls per player.

SeleniumBase - 📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.

dude vs Edu-Mail-Generator shot-scraper vs gmail-sidebar-drive dude vs python-web-scraping-primjeri shot-scraper vs zettelkasten dude vs scrapy-playwright shot-scraper vs scrape-san-mateo-fire-dispatch dude vs FastDepends shot-scraper vs bbcrss dude vs HomeHarvest shot-scraper vs scrape-hacker-news-by-domain dude vs dnd-roll-parser shot-scraper vs SeleniumBase

Compare dude vs shot-scraper and see what are their differences.

dude

shot-scraper

dude

shot-scraper

What are some alternatives?