Anyone Experienced with Crawling Websites?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

hn-search

1,618 524 2.9 TypeScript

Hacker News Search

1. The precedent (so far) is scraping is legal if the scraped data is publicly available[A].
2. I guess the best approach depends on what data you're scraping. Some data it's fine to first convert to plain text, then scrape scrape that.
For structured data like tables and HTML, you're better off using the structure of the HTML itself.
I suppose you could design a framework that covers all the common tasks, then feed the framework parameters for each site.
It's not just handling different sites: the same site will change over time, and there will be oddities between pages/items on the same site.
[A]: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

changedetection.io

196 14,870 9.5 Python

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

You can use an open-source tool like this one: https://github.com/dgtlmoon/changedetection.io

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project