datahub
OpenLineage
Our great sponsors
datahub | OpenLineage | |
---|---|---|
34 | 5 | |
9,168 | 1,568 | |
1.9% | 2.3% | |
9.9 | 9.7 | |
5 days ago | 6 days ago | |
Java | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
datahub
- ODD Platform - An open-source data discovery and observability service - v0.12 release
-
What data governance tool are you folks using?
I’m a huge fan of DataHub, the open source data catalogue spun out of LinkedIn, but it’s best thought of as an observability layer for data assets that can be shared by data engineers and analyst-types. For data users: it’s a stellar search/discovery interface (what datasets are there on this keyword, which are most broadly used across the organization, what downstream products are made with this data, what’s it usually joined to, are it’s upstream pipelines reliable). For data engineers, it’s a comprehensive asset cataloger, crawling your warehouse, orchestrator, modeling layers, features, and reports, matching the lineage into a graph where it can.
- Our data catalog is difficult to manage and not built for the wider org - what can we do?
-
Looking for an "offline" data discovery platform
What I am looking for is a solution (similar to Amundsen or [Datahub](https://datahubproject.io/)) that also allows to add tables and their metadata manually.
-
Looking for an open-source data lineage app, where objects and connections can be manually defined (not just automatically ingested)
Hello everyone, I'm looking for an open-source data lineage app (e.g. tokern, datahubproject, openmetadata).
-
Recommended Data Governance solution for smaller businesses?
Check out https://datahubproject.io/ or https://open-metadata.org. both have a free version to try.
-
Metadata Store - Which one to Choose ? OpenMetadata vs Datahub ?
We use Kubernetes as our deployment platform. Any feedback on one of these open source data catalogs ? - https://atlas.apache.org/#/ - https://opendatadiscovery.org/ - https://open-metadata.org/ - https://marquezproject.github.io/marquez/ - https://datahubproject.io/ - https://www.amundsen.io/ - https://ckan.org/ - https://magda.io/
-
What’s your process for deploying a data pipeline from a notebook, running it, and managing it in production?
Something like this? https://datahubproject.io/
-
Field Lineage
There are specialized tools like DataHub (see this for columnar level reporting: https://feature-requests.datahubproject.io/roadmap/541 ) that would help. But really, in a good data platform, the orchestration layer should be aggregating metadata and giving you everything you need to trace lineage, A tool like Dagster does this well if you make full use of the Software Defined Assets capability, but that is fairly new so not so many people have embraced it yet.
-
LinkedDataHub: The Knowledge Graph Notebook
LinkedDataHub, a "RDF-native notebook", is not to be confused with LinkedIn DataHub, which is a metadata store/crawler/ui for your data systems: https://datahubproject.io/.
OpenLineage
-
Field Lineage
Column-level lineage in OpenLineage is in its early days. There's support in the spec for it, and the integration with Spark currently emits column-level metadata. You can see the facet definition here.
-
Metadata and how to capture it
Data Lineage Specification: - OpenLineage https://github.com/OpenLineage/OpenLineage
- Is Airflow a passé? What replaces it?
What are some alternatives?
OpenMetadata - Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
amundsen - Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
atlas - A modern tool for managing database schemas
dagster - An orchestration platform for the development, production, and observation of data assets.
metacat
Atlas - 🚀 An open and lightweight modification to Windows, designed to optimize performance, privacy and security.
monosi - Open source data observability platform
dbt-synapse - dbt adapter for Azure Synapse Dedicated SQL Pools
CKAN - CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
SchemaCrawler - Free database schema discovery and comprehension tool
metadata-extractor - Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files