getting-started
Mage
Our great sponsors
getting-started | Mage | |
---|---|---|
16 | 77 | |
1,220 | 7,001 | |
0.1% | 5.6% | |
0.0 | 9.9 | |
about 1 year ago | 5 days ago | |
Makefile | Python | |
- | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
getting-started
-
Why do companies still build data ingestion tooling instead of using a third-party tool like Airbyte?
Coincidently, I saw a presentation today on a nice half-way-house solution: using embeddable Python libraries like Sling and dlt - both open-source. See https://www.youtube.com/watch?v=gAqOLgG2iYY There is also singer.io which is more of a protocol than a library, but can also be installed although it looks like it is a true community effort and not so well maintained.
-
Data sources episode 2: AWS S3 to Postgres Data Sync using Singer
Singer is an open-source framework for data ingestion, which provides a standardized way to move data between various data sources and destinations (such as databases, APIs, and data warehouses). Singer offers a modular approach to data extraction and loading by leveraging two main components: Taps (data extractors) and Targets (data loaders). This design makes it an attractive option for data ingestion for several reasons:
- Design patter for Python ETL
-
Launch HN: Patterns (YC S21) ā A much faster way to build and deploy data apps
Thanks for chipping in.
Iāve been leaning towards this direction. I think I/O is the biggest part that in the case of plain code steps still needs fixing. Input being data/stream and parameterization/config and output being some sort of typed data/stream.
My āletās not reinvent the wheelā alarm is going of when I write that though. Examples that come to mind are text based (Unix / https://scale.com/blog/text-universal-interface) but also the Singer tap protocol (https://github.com/singer-io/getting-started/blob/master/doc...). And config obviously having many standard forms like ini, yaml, json, environment key value pairs and more.
At the same time, text feels horribly inefficient as encoding for some of the data objects being passed around in these flows. More specialized and optimized binary formats come to mind (Arrow, HDF5, Protobuf).
Plenty of directions to explore, each with their own advantages and disadvantages. I wonder which direction is favored by users of tools like ours. Will be good to poll (do they even care?).
PS Windmill looks equally impressive! Nice job
-
After Airflow. Where next for DE?
Mage uses the Singer Spec (https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md), the data engineer community standard for building data integrations. This was created by Stitch and is widely adopted.
-
Basic data engineering question.
I like the Singer Protocol, and the various tools that use it. These include meltano, airbyte, stitch, pipelinewise, and a few others
-
I have hundreds of API data endpoints with different schemas. How do I organize?
Have you looked into using a dedicated data integration tool? Have you heard of Singer and the Singer Spec? https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md
-
CDC (Change Data Capture) with 3rd party APIs
Or you could build your own such system and run it on Airflow, Prefect, Dagster, etc. Check out the Singer project for a suite of Python packages designed for such a task. Quality varies greatly, though.
-
Questions about Integration Singer Specification with AWS Glue
Our team is building out a data platform on AWS glue, and we pull from a variety of data sources including application databases and third party SaaS APIs. I have been looking into ways to standardize pulling data from different sources. The other day I came across the [Singer Specification](https://github.com/singer-io/getting-started) and was interested learning more about it. If anyone has experience working with Singer specifications, I would love to hear more about:
-
Anybody have experience creating singer taps and targets?
I just read the readme of the Singer getting started repo and am excited to write my first tap! Iām thinking instead of writing a new Airflow DAG whenever I want to pipe API data into our data warehouse I could write a singer tap and use Stitch instead. Is that a stupid idea?
Mage
- FLaNK AI-April 22,Ā 2024
-
A mage on the Heroās Journey: a fantasy epic on how a startup rose from the ashes
In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
-
Data sources episode 2: AWS S3 to Postgres Data Sync using Singer
Link to original blog: https://www.mage.ai/blog/data-sources-ep-2-aws-s3-to-postgres-data-sync-using-singer
-
What are some open-source ML pipeline managers that are easy to use?
I would recommend the following: - https://www.mage.ai/ - https://dagster.io/ - https://www.prefect.io/ - https://metaflow.org/ - https://zenml.io/home
-
Mage Battlegrounds: Craft insights from real-time customer behavior analysis
You're invited to participate in the very first Mage Battlegrounds: Craft insights from real-time customer behavior analysis, a 24-hour virtual hackathon hosted by Shashank Mishra! This data engineering competition will take place on Saturday, April 15, 2023 beginning at 11am (PST). This will be a global event open to all participants who register.
-
Looking for an open-source project
Try this feature: https://github.com/mage-ai/mage-ai/issues/1166
-
Daskqueue: Dask-based distributed task queue
Seeing if we can use it in https://github.com/mage-ai/mage-ai
-
Data Pipeline on a Shoestring
That being said thereās a solid family of services just breaking ground that make the local pipeline deployment easier (check out https://www.mage.ai, which does have a clear path to cloud deployment of locally developed pipes, it just isnāt well documented yet, and also https://www.neuronsphere.io - which doesnāt have a public solution YET (theyāre internally testing an alpha) but they built a cloud deployable solution for their paying customers and working to release one for freemium use)
-
Trending ML repos of the week š
7ļøā£ mage-ai/mage-ai
-
Delta without using Spark
Yes, check out how Mage does it: https://github.com/mage-ai/mage-ai/tree/master/mage_integrations/mage_integrations/destinations/delta_lake_s3
What are some alternatives?
airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
dagster - An orchestration platform for the development, production, and observation of data assets.
AWS Data Wrangler - pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
vscode-dvc - Machine learning experiment tracking and data versioning with DVC extension for VS Code
meltano
sqlmesh - Efficient data transformation and modeling framework that is backwards compatible with dbt.
tap-hubspot
mito - The mitosheet package, trymito.io, and other public Mito code.
tap-spreadsheets-anywhere
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
singer-sdk
Data-Science-Roadmap - Data Science Roadmap from A to Z