Public datasets to practice pyspark

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • tpch-dbgen

    TPC-H dbgen

  • I prefer TPC-H dataset. It’s a star schema data model used by the major players for benchmarking. You can use this tool to generate data of your size requirement https://github.com/electrum/tpch-dbgen. The data model is here https://docs.snowflake.com/en/user-guide/sample-data-tpch.html.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ask HN: Interesting TUIs (text user interfaces), maybe forgotten ones?

    54 projects | news.ycombinator.com | 6 May 2024
  • 40 years later, a game for the ZX Spectrum will be again broadcast over FM radio

    1 project | news.ycombinator.com | 8 May 2024
  • CLIPSockets: Clips low-level socket implementation

    1 project | news.ycombinator.com | 8 May 2024
  • Ask HN: Any one still using MySQL?

    1 project | news.ycombinator.com | 8 May 2024
  • Dlopen() Metadata for ELF Files

    1 project | news.ycombinator.com | 8 May 2024