Analytics Stacks for Startups

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

dbt-utils

7 1,206 6.2 Python

Utility functions for dbt projects.

Add tests: unit tests in SQL are still not really practical, but testing the data, before allowing users to see it, is possible. dbt has some basic tests like Non-NULL and so on. dbt_utils supports comparing data across tables. If you need more, there is Great Expectation and similar tools. dbt also supports writing SQL queries which output “bad” rows. Use this to, e.g. check a specific order against manually checked correct data. Tests give you confidence that your pipelines produce correct results: nothing is worse than waking up with a Slack message from your boss that the graphs look wrong… They are especially useful in case you have to refactor a data pipeline. Basically every query you would run during the QA phase of a change request has a high potential to become an automatic test.

sqlfluff

35 7,242 9.6 Python

A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

dbt is the current de facto standard for data transformation. dbt has solid best practices, and a lot of nice features out of the box: from support for reusing code fragments via macros (e.g. only maintain a ignore-list in a single place, not all over) to data checks supporting data dictionaries and data lineage. It also has a vast ecosystem of add-on packages, apps, and integrations, if you need more. E.g. Fivetran makes generic dbt models available for some of their data sources. Or a code formatter for the monstrosity that is SQL+jinja :-). There is also a very active community which is worth checking out!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
superset

137 58,956 9.9 TypeScript

Apache Superset is a Data Visualization and Data Exploration Platform

(Dashboards in Metabase and in Apache Superset)

streamlit

258 31,868 9.8 Python

Streamlit — A faster way to build and share data apps.

Finally, if you have to offer some kind of interactive UI on top of some data or machine learning model, Streamlit or flask are good starting points.

dagster

46 10,274 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

But if you have more sophisticated use cases, you usually end up with one of the following orchestration tools for your data pipelines: Airflow, Prefect, or Dagster. They all support the above list of features and much more. Airflow is the established tool, but it can be a beast to maintain, especially for personal dev environments. It might also be that other options fit your needs better. I recommend getting a hosted version instead of setting it up yourself: the time spent maintaining the tool might be better spent on maintaining and developing analytic pipelines. Or improving the developer experience for the team.

nodejs-bigquery

43 457 8.0 TypeScript

Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.

The main DWH offerings that meet the above expectations are Snowflake, Google Bigquery, and Amazon Redshift. Featurewise, these three have similar functionalities but there are differences, e.g. how long it takes to spin up new compute resources or how much maintenance work they need. Costwise, it seems they end up with similar numbers on your bill, depending on which blog post you read.

Airflow

169 34,570 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

But if you have more sophisticated use cases, you usually end up with one of the following orchestration tools for your data pipelines: Airflow, Prefect, or Dagster. They all support the above list of features and much more. Airflow is the established tool, but it can be a beast to maintain, especially for personal dev environments. It might also be that other options fit your needs better. I recommend getting a hosted version instead of setting it up yourself: the time spent maintaining the tool might be better spent on maintaining and developing analytic pipelines. Or improving the developer experience for the team.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
airbyte

139 14,112 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

It is usually easier to pay for managed ingestion services than setting up and maintaining such ingestion pipelines yourself, especially if you need help because your team simply does not have the required expertise. Available services include Fivetran, Stitch, and Airbyte. If using such a service becomes a cost issue, you can still switch to something custom built. But at that point you will probably already have a whole team of data engineers. Apart from gaining access to all the features already built into the service without building these features yourself, there are many advantages to using a managed service. You will not get surprised by changing upstream APIs. There is no need to maintain your own infrastructure for it and these services have all the edge case handling figured out. They also partly shield you from changing APIs or changes in the database schemas (you still have to deal with that in your transformation, though). Another benefit of using managed ingestion services is that they also make it easy to exclude Personal Identifiable Information (PII) data from reaching your DWH. This is sometimes the easiest way to make sure that this data will not leak to the wrong persons or even worse, the internet. Try to keep all PII data out of the DWH by default and only include it if needed (and secured!).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Visions – User defined data type systems

3 projects | /r/Python | 4 Feb 2022
Visions – User defined data type systems

3 projects | /r/Python | 4 Feb 2022
Visions – User defined data type systems

3 projects | /r/datascience | 4 Feb 2022
Show HN: Visions – User defined data type systems

3 projects | news.ycombinator.com | 1 Feb 2022
How to Build a Logistic Regression Model: A Spam-filter Tutorial

1 project | dev.to | 5 May 2024

Analytics Stacks for Startups

This page summarizes the projects mentioned and recommended in the original post on dev.to
Python Data Science HacktoberFest Data Analysis Apache
Post date: 21 Feb 2022

dbt-utils

sqlfluff

InfluxDB

superset

streamlit

dagster

nodejs-bigquery

Airflow

SaaSHub

airbyte

Related posts

Visions – User defined data type systems

Visions – User defined data type systems

Visions – User defined data type systems

Show HN: Visions – User defined data type systems

How to Build a Logistic Regression Model: A Spam-filter Tutorial

Analytics Stacks for Startups

This page summarizes the projects mentioned and recommended in the original post on dev.to Python Data Science HacktoberFest Data Analysis Apache Post date: 21 Feb 2022

Related posts

Visions – User defined data type systems

Visions – User defined data type systems

Visions – User defined data type systems

Show HN: Visions – User defined data type systems

How to Build a Logistic Regression Model: A Spam-filter Tutorial

This page summarizes the projects mentioned and recommended in the original post on dev.to
Python Data Science HacktoberFest Data Analysis Apache
Post date: 21 Feb 2022