Our great sponsors
-
crux-top-lists
Downloadable snapshots of the Chrome Top Million Websites pulled from public CrUX data in Google BigQuery.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.
It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...
Using this dataset, you can build a service similar to https://builtwith.com/ for your research.
Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).
Description: https://github.com/ClickHouse/ClickHouse/issues/18842
You can easily try it with clickhouse-local without downloading:
$ curl https://clickhouse.com/ | sh
It's a tough thing to balance, but generally, bringing in someone's personal details as ammunition in an internet argument is not ok on HN (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...). I'm not saying those are never relevant, but
[editing...]
Yes, it's continuously updated.
The source code is here: https://github.com/ClickHouse/github-explorer
This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...
Related posts
- We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions
- 1 billion rows challenge in PostgreSQL and ClickHouse
- We Executed a Critical Supply Chain Attack on PyTorch
- Tell HN: Hacker News dataset on BigQuery hasn't been updated since Nov 2022
- Real-Time Data Enrichment and Analytics With RisingWave and ClickHouse