Launch HN: Artie (YC S23) – Real time data replication to data warehouses

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • transfer

    Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

  • Hey, I’m Robin and I’m the founder/CTO at Artie (https://www.artie.so/). We solve the problem of stale production data in the data warehouse. We’re open source, and you can try it for free here: https://github.com/artie-labs/transfer.

    Specifically, we do real time data replication from databases to data warehouses. We leverage change data capture (CDC) and stream processing to perform data transfers efficiently, enabling sub-minute latency and significant cost savings. Here’s a quick demo (jumping straight to the important part): https://www.youtube.com/watch?v=uAi1tm4gd9U#t=81s.

    I encountered this problem when I was a heavy data warehouse user at prior jobs. The data in our data warehouse was super lagged and analytics were always stale. Imagine a fintech company performing anti-fraud/transaction monitoring with batched ETLs and finding out that fraudulent transactions occurred 24 hours ago. This was very frustrating to me! Since my background was in distributed systems and database architecture, I knew that there was a better way to perform data transfers.

    The more teams I spoke with, the more I realized this was a real pain point. People wanted real time data for analytics, fraud alerting, transaction monitoring, and training AI/ML models; but there wasn’t an easy out-of-the-box solution. Companies were either constrained on latency or schema integrity/data usability or data accuracy. Companies started telling me that if I built a tool that is robust, near real time, but also maintained schema integrity and data accuracy, they would very happily pay for it.

    So I built Artie, a real time, open-source data streaming solution to transfer data from databases to data warehouses and handle schemas automatically in-flight (DMLs and DDLs).

    Typical ETL solutions leverage batched processes that run on a schedule (DAGs, Airflow), which cannot achieve real time data syncs. This means that when companies aggregate production data into their data warehouse, the underlying data is always stale.

    Streaming change data capture (CDC) logs is a more efficient way to transfer data, and helps lower networking/data ingestion costs as well. However, building data pipelines with CDC streaming is complicated. I wanted Artie to be the tool that abstracts away that complexity so that any company can benefit from having real time data.

    A big challenge is implementing CDC streaming for stateful data (i.e. row updates/deletes) and schema changes (DDLs), which most streaming solutions just don’t address, meaning that complexity is passed down to the customer. A lot of in-house streaming solutions leverage some combination of Debezium + Kafka/Kinesis + Apache Flink and are able to achieve near real time syncs, but they only handle append-only events (inserts) and don't handle schema changes/schema evolution like DMLs and DDLs. Not handling thse means the data at the destination doesn't look exactly like the production database, which obscures the source of truth. You end up having to do additional work to make the data warehouse tables match the source DB.

    So how do we offer a robust CDC streaming solution? We grab CDC logs using Debezium and/or our custom connector (which solves for certain edge cases that Debezium doesn’t handle) and push them into Kafka (or Google Pub/Sub). Kafka helps ensure ordering and ease of recovery upon an outage - we use one table per topic, and the partition key is the primary key(s) to ensure no out of order writes. Artie then consumes these events from Kafka and we have an in-memory DB with our typing library that can infer schemas (DML and DDL), performs optimizations like deduplications, and then flushes data to the data warehouse. When the data warehouse confirms a successful merge, we then commit the offset within Kafka. This all happens with sub-minute data latency even at high volumes (several TBs or billions of rows).

    However the data looks in your database, it should look exactly the same in your data warehouse. We also strive to handle all data types with no exceptions - i.e. supporting TOAST columns, composite keys as primary keys, arrays, large rows (>1MBs), etc. (I say “strive” because I’m sure we haven’t seen all possible data types yet!).

    We’ve been live for a while now. Several companies use us to update all their analytic dashboards in real time (with dashboards built on top of their data warehouse). Fintech platforms use us to perform financial transaction monitoring. A utilities software platform uses us to grab video/photo data to perform risk/hazard assessment against ML models.

    Artie is live and supports PostgreSQL, MySQL, MongoDB sources and Snowflake, BigQuery, Redshift destinations. We make money from our hosted service. We charge based on usage (# of rows transferred per month, not including initial snapshots).

    We’d love for you to try it out! You can get started with the open source version here: https://github.com/artie-labs/transfer. We have a small OSS Slack community (www.artie.so/slack) – feel free to ping us for help or any other requests!

    For our hosted version, we need to ensure that we have enough storage and compute capacity provisioned, so we’re asking cloud users to hop on a quick call with us before we activate your account. Eventually we’ll have easy self-serve functionality but that’s not 100% built yet, so for now we set up a Slack channel to ensure smooth deployments. If you’re willing to work with us on that, we’ll be super excited to show you what we’ve got. Just email us at [email protected] or request access from https://www.artie.so.

    We’d love for you to try the OSS or hosted solution and give us feedback! We’re eager to improve the product and test it against various workloads and data types :)

  • universal-data

    Lightweight ELT/ETL that's perfect for api data extraction and transformation.

  • My goal was to leverage LLM to auto-build connector on the fly.

    I started by building a lightweight ETL, that I open-sourced (https://github.com/BenderV/universal-data).

    I left the space because I realize that I didn't want to work on this problem, even though I believe in the "AI" approach, and think simplifying data transfer (or distributing compute) is one the key factor to scale data usage.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Thanks for the comment!

    Your comment regarding DDL is interesting.

    Today, this is what happens:

    1/ Column doesn't exist in the destination, we'll create it based on our typing inference from the data type (important: not the data value).

    2/ Certain tools will handle automatic column data type conversion if a change like this was detected at the source. We do not do this. We will simply hard fail and cause head-of-line blocking reasons being: this is anti-pattern and should be rare, in which case - it's okay to cause an err and require manual intervention for this breaking change.

    3/ If the column has been dropped from the source, you as the end user can decide whether this column should be also dropped in the destination, or not. The default is not to drop it.

    ^ We hear more customers explicitly don't want columns to be dropped because it could cause downstream errors, such as other views / tables not compiling due to referencing a non-existent column.

    We haven't heard much from folks that don't even want columns to be added. If there is a need, we can definitely add that as a config option to provide maximum configurability.

    > Finally, the biggest issue with CDC always ends up being the seed loads, recoveries and the incremental snapshot strategies.

    Yep totally. On the recovery bit, this is exactly why we are leveraging Kafka. If there are any particular issues, we simply don't commit the offset and cause head-of-line blocking.

    On the incremental snapshot and recoveries bit, we primarily leverage Debezium's DDD-3 high watermark strategy [1] for MySQL and MongoDB. Postgres has a different issue in that replication slots can grow really fast, esp on AWS! [2]. We ended up writing our own custom snapshotter for Postgres that is Debezium compatible to onboard customers that have a massive dataset and cannot afford to have a read lock on their WAL.

    [1] https://github.com/debezium/debezium-design-documents/blob/m...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Migrate mongodb Datawarehouse to snowflake

    1 project | /r/snowflake | 4 Dec 2023
  • Preventing replication slot overflow on Postgres DB (AWS RDS)

    1 project | news.ycombinator.com | 11 Sep 2023
  • Preventing WAL Growth on Postgres DB Running on AWS RDS

    1 project | news.ycombinator.com | 10 Sep 2023
  • Artie Transfer: Real time data replication from transactional databases to Snowflake

    1 project | /r/snowflake | 20 Jan 2023
  • Ask HN: Best way to mirror a Postgres database to parquet?

    1 project | news.ycombinator.com | 10 Apr 2024