Cached Chrome Top Million Websites

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • ClickHouse

    ClickHouse® is a free analytics DBMS for big data

  • If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.

    It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...

    Using this dataset, you can build a service similar to https://builtwith.com/ for your research.

    Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).

    Description: https://github.com/ClickHouse/ClickHouse/issues/18842

    You can easily try it with clickhouse-local without downloading:

      $ curl https://clickhouse.com/ | sh

  • crux-top-lists

    Downloadable snapshots of the Chrome Top Million Websites pulled from public CrUX data in Google BigQuery.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • It's a tough thing to balance, but generally, bringing in someone's personal details as ammunition in an internet argument is not ok on HN (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...). I'm not saying those are never relevant, but

    [editing...]

  • github-explorer

    Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask)

  • Yes, it's continuously updated.

    The source code is here: https://github.com/ClickHouse/github-explorer

    This shell scripts updates it: https://github.com/ClickHouse/github-explorer/blob/main/upda...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts