Analytics Stacks for Startups

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • dbt-utils

    Utility functions for dbt projects.

  • Add tests: unit tests in SQL are still not really practical, but testing the data, before allowing users to see it, is possible. dbt has some basic tests like Non-NULL and so on. dbt_utils supports comparing data across tables. If you need more, there is Great Expectation and similar tools. dbt also supports writing SQL queries which output “bad” rows. Use this to, e.g. check a specific order against manually checked correct data. Tests give you confidence that your pipelines produce correct results: nothing is worse than waking up with a Slack message from your boss that the graphs look wrong… They are especially useful in case you have to refactor a data pipeline. Basically every query you would run during the QA phase of a change request has a high potential to become an automatic test.

  • sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

  • dbt is the current de facto standard for data transformation. dbt has solid best practices, and a lot of nice features out of the box: from support for reusing code fragments via macros (e.g. only maintain a ignore-list in a single place, not all over) to data checks supporting data dictionaries and data lineage. It also has a vast ecosystem of add-on packages, apps, and integrations, if you need more. E.g. Fivetran makes generic dbt models available for some of their data sources. Or a code formatter for the monstrosity that is SQL+jinja :-). There is also a very active community which is worth checking out!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • superset

    Apache Superset is a Data Visualization and Data Exploration Platform

  • (Dashboards in Metabase and in Apache Superset)

  • streamlit

    Streamlit — A faster way to build and share data apps.

  • Finally, if you have to offer some kind of interactive UI on top of some data or machine learning model, Streamlit or flask are good starting points.

  • dagster

    An orchestration platform for the development, production, and observation of data assets.

  • But if you have more sophisticated use cases, you usually end up with one of the following orchestration tools for your data pipelines: Airflow, Prefect, or Dagster. They all support the above list of features and much more. Airflow is the established tool, but it can be a beast to maintain, especially for personal dev environments. It might also be that other options fit your needs better. I recommend getting a hosted version instead of setting it up yourself: the time spent maintaining the tool might be better spent on maintaining and developing analytic pipelines. Or improving the developer experience for the team.

  • nodejs-bigquery

    Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.

  • The main DWH offerings that meet the above expectations are Snowflake, Google Bigquery, and Amazon Redshift. Featurewise, these three have similar functionalities but there are differences, e.g. how long it takes to spin up new compute resources or how much maintenance work they need. Costwise, it seems they end up with similar numbers on your bill, depending on which blog post you read.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • But if you have more sophisticated use cases, you usually end up with one of the following orchestration tools for your data pipelines: Airflow, Prefect, or Dagster. They all support the above list of features and much more. Airflow is the established tool, but it can be a beast to maintain, especially for personal dev environments. It might also be that other options fit your needs better. I recommend getting a hosted version instead of setting it up yourself: the time spent maintaining the tool might be better spent on maintaining and developing analytic pipelines. Or improving the developer experience for the team.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • It is usually easier to pay for managed ingestion services than setting up and maintaining such ingestion pipelines yourself, especially if you need help because your team simply does not have the required expertise. Available services include Fivetran, Stitch, and Airbyte. If using such a service becomes a cost issue, you can still switch to something custom built. But at that point you will probably already have a whole team of data engineers. Apart from gaining access to all the features already built into the service without building these features yourself, there are many advantages to using a managed service. You will not get surprised by changing upstream APIs. There is no need to maintain your own infrastructure for it and these services have all the edge case handling figured out. They also partly shield you from changing APIs or changes in the database schemas (you still have to deal with that in your transformation, though). Another benefit of using managed ingestion services is that they also make it easy to exclude Personal Identifiable Information (PII) data from reaching your DWH. This is sometimes the easiest way to make sure that this data will not leak to the wrong persons or even worse, the internet. Try to keep all PII data out of the DWH by default and only include it if needed (and secured!).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Visions – User defined data type systems

    3 projects | /r/Python | 4 Feb 2022
  • Visions – User defined data type systems

    3 projects | /r/Python | 4 Feb 2022
  • Visions – User defined data type systems

    3 projects | /r/datascience | 4 Feb 2022
  • Show HN: Visions – User defined data type systems

    3 projects | news.ycombinator.com | 1 Feb 2022
  • How to Build a Logistic Regression Model: A Spam-filter Tutorial

    1 project | dev.to | 5 May 2024