|26 days ago||5 days ago|
|Apache License 2.0||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Why Databricks Is Winning
5 projects | news.ycombinator.com | 14 Feb 2021
Snowflake and Databricks are different, sometimes complementary technologies. You can store data in Snowflake & query it with Databricks for example: https://github.com/snowflakedb/spark-snowflake
Snowflake predicate pushdown filtering seems quite promising: https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...
Think both these companies can win.
5 Reasons Your Data Lakehouse should Embrace Dremio Cloud
2 projects | dev.to | 9 Aug 2022
You can query data organized in many open table formats like Apache Iceberg and Delta Lake. (Here is a good article on what is a table format and the differences between different ones)
Delta 2.0 - The Foundation of your Data Lakehouse is Open
2 projects | reddit.com/r/apachespark | 5 Aug 2022
Note that the roadmap can be found at https://github.com/delta-io/delta/issues/1307 and we’re actively asking for feedback so we can prioritize the remaining items. Please chime in there so we can track and re-prioritize! Thanks!2 projects | reddit.com/r/apachespark | 5 Aug 2022
Still not quite completely on par with the Databricks version, with missing features like GENERATED ALWAYS AS IDENTITY, but this is getting good.
Databricks platform for small data, is it worth it?
3 projects | reddit.com/r/dataengineering | 29 Jun 2022
Currently the infrastructure we have is some custom made pipelines that load the data on S3, and I use Delta Tables here and there for its convenience: ACID, time travel, merges, CDC etc...
Data point versioning infrastructure for time traveling to a precise point in time?
2 projects | reddit.com/r/dataengineering | 18 Jun 2022
I've been playing around a bit with Delta (Table/Lake) whatever you want to call it. It has time travel so you can look back and see what the data looked like at a particular point in time. https://delta.io/
How-to-Guide: Contributing to Open Source
19 projects | reddit.com/r/dataengineering | 11 Jun 2022
What companies/startups are using Scala (open source projects on github)?
13 projects | reddit.com/r/scala | 24 May 2022
There are so many of them in big data, e.g. Kafka, Spark, Flink, Delta, Snowplow, Finagle, Deequ, CMAK, OpenWhisk, Snowflake, TheHive, TVM-VTA, etc.
What is a Delta Table?
2 projects | reddit.com/r/dataengineering | 20 May 2022
Ah. I believe you are correct. As I look at the examples section for python in the github repo these are looking almost identical to what I was seeing. https://github.com/delta-io/delta/blob/master/examples/python/quickstart.py2 projects | reddit.com/r/dataengineering | 20 May 2022
It is a specific table format. https://delta.io/ it’s an open source project just read their website, will have way more info than these comments.
Postgres and Parquet in the Data Lke
7 projects | news.ycombinator.com | 3 May 2022
I would suggest to look onto Delta Lake (https://delta.io/) - it's built on top of the Parquet, but has advantages over the plain parquet:
- transactions - you don't get a garbage in your table if your write failed
What are some alternatives?
hudi - Upserts, Deletes And Incremental Processing on Big Data.
delta-rs - A native Rust library for Delta Lake, with bindings into Python and Ruby.
lakeFS - Git-like capabilities for your object storage
iceberg - Apache Iceberg
dvc - 🦉Data Version Control | Git for Data & Models | ML Experiments Management
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
Jupyter Scala - A Scala kernel for Jupyter
connectors - This library allows Scala and Java-based projects (including Apache Flink, Apache Hive, Apache Beam, and PrestoDB) to read from and write to Delta Lake.
cobrix - A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
LakeSoul - A Table Structure Storage on Data Lakes to Unify Batch and Streaming Data Processing
databricks-nutter-repos-demo - Demo of using the Nutter for testing of Databricks notebooks in the CI/CD pipeline