Data engineering and Clojure?

This page summarizes the projects mentioned and recommended in the original post on /r/Clojure

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • geni

    A Clojure dataframe library that runs on Spark

    I think for the large scale stuff, wrappers like geni are pretty nice and built on top of established tech. There were several distributed computing platforms like onyx and storm that popped up in clojure as well that may be interesting to look at. clojure toolbox has a good index of libraries to examine.

  • libpython-clj

    Python bindings for Clojure

    Also recent developments like libpython-clj open up the python ecosystem if there's stuff you want to incorporate from clojure (also bidirectional).

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • tech.ml.dataset

    A Clojure high performance data processing system

    For single-node work for ETL stuff, tech.ml.dataset is the emerging standard and is very efficient and capable of interop with various storage medium (including arrow, parquet, etc.). It has the ability to work with larger-than-memory data as well, although currently not use in a distributed fashion, so single-machine only. tablecloth is a dyplr-familiar clojure API on top of tech.ml.dataset.

  • tech.ml

    This library has been superceded by https://github.com/scicloj/scicloj.ml.

    For ml, there's a lot of work going on integrating stuff from various ecosystems (java, scala, clojure). tech.ml is the original entry in this space, and is being worked with to merge with some other efforts, mainly around ML pipelines akin to sklearn.

  • jackdaw

    A Clojure library for the Apache Kafka distributed streaming platform. (by FundingCircle)

    Depending on the scale, you may also find the Jackdaw wrappers for Kafka streams a good option.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts