Build your own “data lake” for reporting purposes

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • sgr

    sgr (command line client for Splitgraph) and the splitgraph Python library

    It's really cool to see these techniques in the wild. We're doing something very similar to implement our "Data Delivery Network" at Splitgraph [0] [1]. Recently we've started calling Splitgraph a "Data Mesh" [2]. As long as we have a plugin [3] for a data source, users can connect external data sources to Splitgraph and make them addressable alongside all the other data on the platform, including versioned snapshots of data called data images. [4] So you can `SELECT FROM namespace/repo:tag` where `tag` can refer to an immutable version of the data, or e.g. `live` to route to route to a live external data source via FDW. So far we have plugins for Snowflake, CSV in S3 buckets, MongoDB, ElasticSearch, Postgres, and a few others, like Socrata data portals (which we use to index 40k open public datasets).

    Our goal with Splitgraph is to provide a single interface to query and discover data. Our product integrates the discovery layer (a data catalog) with the query layer (a Postgres compatible proxy to data sources, aka a "data mesh" or perhaps "data lake"). This way, we improve both the catalog and the access layer in ways that would be difficult or impossible as separate products. The catalog can index live data without "drift" problems. And since the query layer is a Postgres-compatible proxy, we can apply data governance rules at query time that the user defines in the web catalog (e.g. sharing data, access control, column masking, query whitelisting, rewriting, rate limiting, auditing, firewalling, etc.).

    We like to use GitLab's strategy as an analogy. GitLab may not have the best CI, the best source control, the best Kubernetes deploy orchestration, but by integrating them all together in one platform, they have a multiplicative effect on the platform itself. We think the same logic can apply to the data stack. In our vision of the world, a "data mesh" integrated with a "data catalog" can augment or eventually replace various complicated ETL and warehousing workflows.

    P.S. We're hiring immediately for all-remote Senior Software Engineer positions, frontend and backend [5]

    [0] https://www.splitgraph.com

    [1] We talked about all this in depth on a podcast: https://softwareengineeringdaily.com/2020/11/06/splitgraph-d...

    [2] https://martinfowler.com/articles/data-monolith-to-mesh.html

    [3] https://www.splitgraph.com/blog/foreign-data-wrappers

    [4] https://www.splitgraph.com/docs/concepts/images

    [5] Job posting: https://www.notion.so/splitgraph/Splitgraph-is-Hiring-25b421...

  • dremio-oss

    Dremio - the missing link in modern data

    For my home projects I generate parquet (columnar and very well suited for DW like queries) files with pyarrow and use https://github.com/dremio/dremio-oss (https://www.dremio.com/on-prem/) to query them on lake (minio or just local disk or s3) and use Apache Superset for quick charts or dashboards.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

    Minio and nifi, require machines by themselves. Better off pure python and if obe wants sonething lighweight and visually pleasing Mara [0] or Dagster with Dagit [1] will do the job

    [0] https://github.com/mara/mara-pipelines

    [1] https://docs.dagster.io/tutorial/execute

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts