Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 16 data-lake Open-Source Projects
-
kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
-
Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
-
kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
-
Local-Data-LakeHouse
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
# Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key
Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | news.ycombinator.com | 2023-12-03SEEKING FREELANCER | REMOTE | GERMANY
dltHub is looking for a freelance help in the following repos:
- https://github.com/dlt-hub/dlt
Project mention: GitHub – GSA/code-gov: An informative repo for all Code.gov repos | news.ycombinator.com | 2023-09-09https://github.com/simonw/datasette-lite :
> You can use this tool to open any SQLite database file that is hosted online and served with a `access-control-allow-origin: ` CORS header. Files served by GitHub Pages automatically include this header, as do database files that have been published online using `datasette publish`.*
> [...] You can paste in the "raw" URL to a file, but Datasette Lite also has a shortcut: if you paste in the URL to a page on GitHub or a Gist it will automatically convert it to the "raw" URL for you
> To load a Parquet file, pass a URL to `?parquet=`
> [...] https://lite.datasette.io/?parquet=https://github.com/Terada...*
There are various *-to-sqlite utilities that load data into a SQLite database for use with e.g. datasette. E.g. Pandas with `dtype_backend='arrow'` saves to Parquet.
datasette plugins are written in Python and/or JS w/ pluggy:
Project mention: Shout out to Appsmith developers to check out this new tool! | /r/lowcode | 2023-07-09I am one of the members of an open-source project VulcanSQL, a Data API Framework for data applications that helps data folks create and share data APIs faster.
Project mention: BtrBlocks: Efficient Columnar Compression for Data Lakes [pdf] | news.ycombinator.com | 2023-09-16
https://github.com/Canner/vulcan-sql-examples/tree/main/hugg...
You can find the code related to this project in my GitHub repository.
data-lake related posts
- Show HN: Data load tool(dlt)-Python library to automate the creation of datasets
- Data load tool (dlt) – open-source Python library that makes data loading easy
- BtrBlocks: Efficient Columnar Compression for Data Lakes [pdf]
- [Discussion] How to implement Data Contracts generically? Seeking advice from data contract users.
- A Step-by-Step Guide to Implementing Data Version Control
- Shout out to Appsmith developers to check out this new tool!
- Hey, data engineers! Are you building data applications with Snowflake/Bigquery?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Apr 2024
Index
What are some of the best open-source data-lake projects? This list will help you:
Project | Stars | |
---|---|---|
1 | lakeFS | 4,058 |
2 | kyuubi | 1,928 |
3 | dlt | 1,722 |
4 | bitsail | 1,576 |
5 | Udacity-Data-Engineering-Projects | 1,295 |
6 | kylo | 1,091 |
7 | Data-Engineering-Projects | 637 |
8 | vulcan-sql | 592 |
9 | cuelake | 284 |
10 | amazon-s3-find-and-forget | 232 |
11 | btrblocks | 178 |
12 | rtdl | 43 |
13 | Local-Data-LakeHouse | 43 |
14 | cnfuzz | 36 |
15 | vulcan-sql-examples | 17 |
16 | udacity_bike_share_datalake_project | 0 |
Sponsored