Why Databricks Is Winning

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Onboard AI - Learn any GitHub repo in 59 seconds
  • SaaSHub - Software Alternatives and Reviews
  • spark-snowflake

    Snowflake Data Source for Apache Spark.

    Snowflake and Databricks are different, sometimes complementary technologies. You can store data in Snowflake & query it with Databricks for example: https://github.com/snowflakedb/spark-snowflake

    Snowflake predicate pushdown filtering seems quite promising: https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...

    Think both these companies can win.

  • dask-gateway

    A multi-tenant server for securely deploying and managing Dask clusters.

    I’ve had a lot of success with Dask lately. It’s comparable to spark in some ways [0]. Being written in python and built on top of pandas/numpy it allows much more flexibility. It also has great tools built on top of kubernetes making deployment quick and easy [1].

    [0] https://docs.dask.org/en/latest/spark.html

    [1] https://github.com/dask/dask-gateway

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • chispa

    PySpark test helper methods with beautiful error messages

    The last point was for teams that only rely on notebooks, sorry if I didn't make that clear.

    You're right that all those issues can be sidestepped if you build projects in version controlled Git repos, test the code, and deploy JAR / Wheel files.

    Speaking of testing, can you let me know if this PySpark testing fix worked for you ;) https://github.com/MrPowers/chispa/issues/6

  • flintrock

    A command-line tool for launching Apache Spark clusters.

    > * AWS has a managed Spark offering called EMR

    There is also my rinky-dink open source project, Flintrock [0], that will launch open source Spark clusters on AWS for you.

    It's probably not the right tool for production use (and you would be right to wonder why Flintrock exists when we have EMR [1]), but I know of several companies that have used Flintrock at one point or other in production at large scale (like, 400+ node clusters).

    [0]: https://github.com/nchammas/flintrock

    [1]: https://github.com/nchammas/flintrock#why-build-flintrock-wh...

  • databricks-nutter-repos-demo

    Demo of using the Nutter for testing of Databricks notebooks in the CI/CD pipeline

    I’m sorry for delay, will fix ASAP...

    My point is that you can do that even without jars/wheels - you can do VC and tests of notebooks. For example, https://github.com/alexott/databricks-nutter-projects-demo

  • Onboard AI

    Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts