mara-pipelines
citus
Our great sponsors
mara-pipelines | citus | |
---|---|---|
3 | 61 | |
2,054 | 9,840 | |
0.4% | 3.6% | |
6.0 | 9.4 | |
5 months ago | 6 days ago | |
Python | C | |
MIT License | GNU Affero General Public License v3.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
mara-pipelines
-
How to keep track of the different Transformations done in an ETL pipeline?
The closest I've found is Mara but not what I'm after.
-
Using PostgreSQL as a Data Warehouse
The tooling behind the approach has been built as a set of python package named Mara. It is available at GitHub:
https://github.com/mara/mara-pipelines
And additional packages can be found at the Mara org:
https://github.com/mara
-
Build your own “data lake” for reporting purposes
Minio and nifi, require machines by themselves. Better off pure python and if obe wants sonething lighweight and visually pleasing Mara [0] or Dagster with Dagit [1] will do the job
[0] https://github.com/mara/mara-pipelines
[1] https://docs.dagster.io/tutorial/execute
citus
- SPQR 1.3.0: a production-ready system for horizontal scaling of PostgreSQL
- Citus: PostgreSQL extension that transforms Postgres into a distributed database
-
Figma's Databases team lived to tell the scale
I see they don't mention Citus (https://github.com/citusdata/citus), which is already a fairly mature native Postgres extension. From the details given in the article, in sounds like they just reimplemented it.
I wonder if they were unaware of it or disregarded it for a reason —I currently am in a similar situation as the one described in the blog, trying to shard a massive Postgres DB.
-
PostgreSQL Is Enough
It is possible, if you pay for it. You can do Multi-AZ Clustered Instances in RDS, where you get the benefits of Multi-AZ failover with traffic sharing.
If you can run your own infra – at least on an EC2 level – you can do things like Citus [0] for Postgres, which is about as close to "just add database nodes" as you'll get.
[0]: https://www.citusdata.com/
-
Vitess 18
So while searching for something like this for postgres I came across citus. Any one know how that stacks up?
https://github.com/citusdata/citus
- In-Depth Guide: Citus Technical Readme
-
Revolutionizing Database Scaling with CitusDB
References: CitusDB
- Squeeze the hell out of the system you have
- Show HN: Hydra 1.0 – open-source column-oriented Postgres
- Schema-based sharding comes to PostgreSQL with Citus
What are some alternatives?
abcd-hcp-pipeline - bids application for processing functional MRI data, robust to scanner, acquisition and age variability.
Greenplum - Greenplum Database - Massively Parallel PostgreSQL for Analytics. An open-source massively parallel data platform for analytics, machine learning and AI.
kuwala - Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
yugabyte-db - YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
pybaseball - Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
vitess - Vitess is a database clustering system for horizontal scaling of MySQL.
dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
etl-markup-toolkit - ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration
dremio-oss - Dremio - the missing link in modern data
stolon - PostgreSQL cloud native High Availability and more.