getting-started vs orchest

getting-started

This repository is a getting started guide to Singer. (by singer-io)

Source Code

singer.io

Suggest alternative

Edit details

orchest

Build data pipelines, the easy way 🛠️ (by orchest)

Data Science Machine Learning Pipelines IDE Jupyter Cloud self-hosted Jupyterlab Notebooks Docker Python data-pipelines orchest Deployment Kubernetes Airflow Dag ETL etl-pipeline

Source Code

orchest.readthedocs.io

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

getting-started		orchest
	Project
16	Mentions	44
1,220	Stars	4,022
0.0%	Growth	0.1%
0.0	Activity	4.5
about 1 year ago	Latest Commit	11 months ago
Makefile	Language	TypeScript
-	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

getting-started

Posts with mentions or reviews of getting-started. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-05-04.

Why do companies still build data ingestion tooling instead of using a third-party tool like Airbyte?
1 project | /r/dataengineering | 6 Dec 2023

Coincidently, I saw a presentation today on a nice half-way-house solution: using embeddable Python libraries like Sling and dlt - both open-source. See https://www.youtube.com/watch?v=gAqOLgG2iYY There is also singer.io which is more of a protocol than a library, but can also be installed although it looks like it is a true community effort and not so well maintained.
Data sources episode 2: AWS S3 to Postgres Data Sync using Singer
2 projects | dev.to | 4 May 2023

Singer is an open-source framework for data ingestion, which provides a standardized way to move data between various data sources and destinations (such as databases, APIs, and data warehouses). Singer offers a modular approach to data extraction and loading by leveraging two main components: Taps (data extractors) and Targets (data loaders). This design makes it an attractive option for data ingestion for several reasons:
Design patter for Python ETL
2 projects | /r/dataengineering | 2 Dec 2022
Launch HN: Patterns (YC S21) – A much faster way to build and deploy data apps
6 projects | news.ycombinator.com | 30 Nov 2022

Thanks for chipping in.
I’ve been leaning towards this direction. I think I/O is the biggest part that in the case of plain code steps still needs fixing. Input being data/stream and parameterization/config and output being some sort of typed data/stream.
My “let’s not reinvent the wheel” alarm is going of when I write that though. Examples that come to mind are text based (Unix / https://scale.com/blog/text-universal-interface) but also the Singer tap protocol (https://github.com/singer-io/getting-started/blob/master/doc...). And config obviously having many standard forms like ini, yaml, json, environment key value pairs and more.
At the same time, text feels horribly inefficient as encoding for some of the data objects being passed around in these flows. More specialized and optimized binary formats come to mind (Arrow, HDF5, Protobuf).
Plenty of directions to explore, each with their own advantages and disadvantages. I wonder which direction is favored by users of tools like ours. Will be good to poll (do they even care?).
PS Windmill looks equally impressive! Nice job
After Airflow. Where next for DE?
13 projects | /r/dataengineering | 15 Nov 2022

Mage uses the Singer Spec (https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md), the data engineer community standard for building data integrations. This was created by Stitch and is widely adopted.
Basic data engineering question.
2 projects | /r/dataengineering | 16 Oct 2022

I like the Singer Protocol, and the various tools that use it. These include meltano, airbyte, stitch, pipelinewise, and a few others
I have hundreds of API data endpoints with different schemas. How do I organize?
1 project | /r/dataengineering | 10 Oct 2022

Have you looked into using a dedicated data integration tool? Have you heard of Singer and the Singer Spec? https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md
CDC (Change Data Capture) with 3rd party APIs
1 project | /r/dataengineering | 23 Sep 2022

Or you could build your own such system and run it on Airflow, Prefect, Dagster, etc. Check out the Singer project for a suite of Python packages designed for such a task. Quality varies greatly, though.
Questions about Integration Singer Specification with AWS Glue
1 project | /r/dataengineering | 26 Aug 2022

Our team is building out a data platform on AWS glue, and we pull from a variety of data sources including application databases and third party SaaS APIs. I have been looking into ways to standardize pulling data from different sources. The other day I came across the [Singer Specification](https://github.com/singer-io/getting-started) and was interested learning more about it. If anyone has experience working with Singer specifications, I would love to hear more about:
Anybody have experience creating singer taps and targets?
1 project | /r/dataengineering | 30 Jan 2022

I just read the readme of the Singer getting started repo and am excited to write my first tap! I’m thinking instead of writing a new Airflow DAG whenever I want to pipe API data into our data warehouse I could write a singer tap and use Stitch instead. Is that a stupid idea?

orchest

Posts with mentions or reviews of orchest. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-12-06.

Decent low code options for orchestration and building data flows?
1 project | /r/dataengineering | 23 Dec 2022

You can check out our OSS https://github.com/orchest/orchest
Build ML workflows with Jupyter notebooks
1 project | /r/programming | 23 Dec 2022
Building container images in Kubernetes, how would you approach it?
2 projects | /r/kubernetes | 6 Dec 2022

The code example is part of our ELT/data pipeline tool called Orchest: https://github.com/orchest/orchest/
Launch HN: Patterns (YC S21) – A much faster way to build and deploy data apps
6 projects | news.ycombinator.com | 30 Nov 2022

First want to say congrats to the Patterns team for creating a gorgeous looking tool. Very minimal and approachable. Massive kudos!
Disclaimer: we're building something very similar and I'm curious about a couple of things.
One of the questions our users have asked us often is how to minimize the dependence on "product specific" components/nodes/steps. For example, if you write CI for GitHub Actions you may use a bunch of GitHub Action references.
Looking at the `graph.yml` in some of the examples you shared you use a similar approach (e.g. patterns/openai-completion@v4). That means that whenever you depend on such components your automation/data pipeline becomes more tied to the specific tool (GitHub Actions/Patterns), effectively locking in users.
How are you helping users feel comfortable with that problem (I don't want to invest in something that's not portable)? It's something we've struggled with ourselves as we're expanding the "out of the box" capabilities you get.
Furthermore, would have loved to see this as an open source project. But I guess the second best thing to open source is some open source contributions and `dcp` and `common-model` look quite interesting!
For those who are curious, I'm one of the authors of https://github.com/orchest/orchest
Argo became a graduated CNCF project
3 projects | /r/kubernetes | 27 Nov 2022

Haven't tried it. In its favor, Argo is vendor neutral and is really easy to set up in a local k8s environment like docker for desktop or minikube. If you already use k8s for configuration, service discovery, secret management, etc, it's dead simple to set up and use (avoiding configuration having to learn a whole new workflow configuration language in addition to k8s). The big downside is that it doesn't have a visual DAG editor (although that might be a positive for engineers having to fix workflows written by non-programmers), but the relatively bare-metal nature of Argo means that it's fairly easy to use it as an underlying engine for a more opinionated or lower-code framework (orchest is a notable one out now).
Ideas for infrastructure and tooling to use for frequent model retraining?
1 project | /r/mlops | 9 Sep 2022
Looking for a mentor in MLOps. I am a lead developer.
1 project | /r/mlops | 25 Aug 2022

If you’d like to try something for you data workflows that’s vendor agnostic (k8s based) and open source you can check out our project: https://github.com/orchest/orchest
Is there a good way to trigger data pipelines by event instead of cron?
1 project | /r/dataengineering | 23 Aug 2022

You can find it here: https://github.com/orchest/orchest Convenience install script: https://github.com/orchest/orchest#installation
How do you deal with parallelising parts of an ML pipeline especially on Python?
5 projects | /r/mlops | 12 Aug 2022

We automatically provide container level parallelism in Orchest: https://github.com/orchest/orchest
Launch HN: Sematic (YC S22) – Open-source framework to build ML pipelines faster
1 project | news.ycombinator.com | 10 Aug 2022

For people in this thread interested in what this tool is an alternative to: Airflow, Luigi, Kubeflow, Kedro, Flyte, Metaflow, Sagemaker Pipelines, GCP Vertex Workbench, Azure Data Factory, Azure ML, Dagster, DVC, ClearML, Prefect, Pachyderm, and Orchest.
Disclaimer: author of Orchest https://github.com/orchest/orchest

What are some alternatives?

When comparing getting-started and orchest you can also consider the following projects:

airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

docker-airflow - Docker Apache Airflow

AWS Data Wrangler - pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

hookdeck-cli - Receive events (e.g. webhooks) in your development environment

meltano

ploomber - The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

tap-hubspot

n8n - Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.

Mage - 🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

label-studio - Label Studio is a multi-type data labeling and annotation tool with standardized output format

tap-spreadsheets-anywhere

Node RED - Low-code programming for event-driven applications

getting-started vs airbyte orchest vs docker-airflow getting-started vs AWS Data Wrangler orchest vs hookdeck-cli getting-started vs meltano orchest vs ploomber getting-started vs tap-hubspot orchest vs n8n getting-started vs Mage orchest vs label-studio getting-started vs tap-spreadsheets-anywhere orchest vs Node RED

Compare getting-started vs orchest and see what are their differences.

getting-started

orchest

getting-started

orchest

What are some alternatives?