Spark scala v/s pyspark

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • frameless

    Expressive types for Spark.

    The preferred way to write Spark programs is to use DataFrame API which is untyped and is essentially the same in Scala, C# and Python. It's a DSL that's used to describe AST of the computation and the end result is the same regardless of language. There's a library called Frameless (https://github.com/typelevel/frameless) that implements typed DataFrame API but it is not in wide use, it looked dead for quite some time (though now development seems to continue) and didn't play nice with IntelliJ IDEA last time I checked. Performance-wise there's no difference most of the time (since all the program does is create an AST) except when using UDFs - Python UDFs are significantly slower and you can't write "proper" UDFs in Python - ones that generate Java code.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts