data-lake

Top 16 data-lake Open-Source Projects

  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

  • Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

    # Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

  • kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

  • Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | news.ycombinator.com | 2023-12-03

    SEEKING FREELANCER | REMOTE | GERMANY

    dltHub is looking for a freelance help in the following repos:

    - https://github.com/dlt-hub/dlt

  • bitsail

    BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

  • Udacity-Data-Engineering-Projects

    Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

  • Project mention: Pitanje za data engineering? | /r/programiranje | 2023-06-30
  • kylo

    Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

  • Project mention: GitHub – GSA/code-gov: An informative repo for all Code.gov repos | news.ycombinator.com | 2023-09-09

    https://github.com/simonw/datasette-lite :

    > You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*

    > [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you

    > To load a Parquet file, pass a URL to `?parquet=`

    > [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*

    There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.

    datasette plugins are written in Python and/or JS w/ pluggy:

  • Data-Engineering-Projects

    Personal Data Engineering Projects

  • Project mention: Pitanje za data engineering? | /r/programiranje | 2023-06-30
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • vulcan-sql

    Data API Framework for AI Agents and Data Apps

  • Project mention: Shout out to Appsmith developers to check out this new tool! | /r/lowcode | 2023-07-09

    I am one of the members of an open-source project VulcanSQL, a Data API Framework for data applications that helps data folks create and share data APIs faster.

  • cuelake

    Use SQL to build ELT pipelines on a data lakehouse.

  • amazon-s3-find-and-forget

    Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

  • btrblocks

    BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

  • Project mention: BtrBlocks: Efficient Columnar Compression for Data Lakes [pdf] | news.ycombinator.com | 2023-09-16
  • rtdl

    rtdl makes it easy to build and maintain a real-time data lake (by realtimedatalake)

  • Local-Data-LakeHouse

    Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.

  • cnfuzz

    Breaking Cloud Native Web APIs in their natural habitat.

  • vulcan-sql-examples

    Curated VulcanSQL show cases

  • Project mention: VulcanSQL and LLMs | news.ycombinator.com | 2023-08-25

    https://github.com/Canner/vulcan-sql-examples/tree/main/hugg...

  • udacity_bike_share_datalake_project

    Azure Data Lake

  • Project mention: Unveiling the Azure Data Lake for Bike Share Data Analytics | dev.to | 2023-10-11

    You can find the code related to this project in my GitHub repository.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

data-lake related posts

Index

What are some of the best open-source data-lake projects? This list will help you:

Project Stars
1 lakeFS 4,058
2 kyuubi 1,928
3 dlt 1,722
4 bitsail 1,576
5 Udacity-Data-Engineering-Projects 1,295
6 kylo 1,091
7 Data-Engineering-Projects 637
8 vulcan-sql 592
9 cuelake 284
10 amazon-s3-find-and-forget 232
11 btrblocks 178
12 rtdl 43
13 Local-Data-LakeHouse 43
14 cnfuzz 36
15 vulcan-sql-examples 17
16 udacity_bike_share_datalake_project 0

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com