amundsen
OpenLineage
Our great sponsors
amundsen | OpenLineage | |
---|---|---|
7 | 5 | |
4,276 | 1,580 | |
1.5% | 3.0% | |
7.8 | 9.8 | |
16 days ago | 3 days ago | |
Python | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
amundsen
-
Quick Start Guide to Amundsen Demo 🚀
We'll be using WSL2 for this guide, and we'll start by cloning this repo and its submodules:
-
Apache Atlas or OpenMetaData?
You can use Amundsen data builder to send data to Apache Atlas, https://github.com/amundsen-io/amundsen/blob/main/databuilder/example/scripts/sample_atlas_search_extractor.py If you don’t have to configure Apache Atlas then why not, but the server side validation the last time when I used it was absent. You couldn’t validate the JSON body sent to the REST API endpoints.
-
Searching for Delta Lake Cataloging
Other than that, maybe you could try amundsen (https://github.com/amundsen-io/amundsen/issues/608) which now has a connector to extract delta lake metadata via Spark.
- Help with Data Discoverability in a Data Lake
-
Launch YC S21: Meet the Batch, Thread #6
How does it differ from something like Amundsen : https://github.com/amundsen-io/amundsen
-
Metadata and how to capture it
Metadata Engine: - Datahub https://github.com/linkedin/datahub - Amundsen https://github.com/amundsen-io/amundsen/ - Marquez https://marquezproject.github.io/ - Egeria - Open Metadata and Governance https://egeria.odpi.org
-
The State of Data Engineering in 2021
A final category worth highlighting is Discovery, where it seems every notable company developed an internal Data Catalogue tool that now is available as an open-source or paid service. Some examples are Amundsen (Lyft), Datahub (LinkedIn), Metacat (Netflix), Databook (Uber), and Dataportal (Airbnb).
OpenLineage
-
What actually is master data management and what do MDM tools do?
There's OpenLineage, which I've never used, but looks reasonably good and integrates with Spark, Airflow, Dagster, and dbt according to the github.
-
Field Lineage
Column-level lineage in OpenLineage is in its early days. There's support in the spec for it, and the integration with Spark currently emits column-level metadata. You can see the facet definition here.
-
Metadata and how to capture it
Data Lineage Specification: - OpenLineage https://github.com/OpenLineage/OpenLineage
- Is Airflow a passé? What replaces it?
- OpenLineage
What are some alternatives?
datahub - The Metadata Platform for your Data Stack
marquez - Collect, aggregate, and visualize a data ecosystem's metadata
dagster - An orchestration platform for the development, production, and observation of data assets.
metacat
hamilton - A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
sickbeard_mp4_automator - Automatically convert video files to a standardized format with metadata tagging to create a beautiful and uniform media library
hamilton - Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
Medusa - Building blocks for digital commerce
amundsendatabuilder - Data ingestion library for Amundsen to build graph and search index
ytmdl - A simple app to get songs from YouTube in mp3 format with artist name, album name etc from sources like iTunes, Spotify, LastFM, Deezer, Gaana etc.