Java Data Science

Open-source Java projects categorized as Data Science | Edit details

Top 8 Java Data Science Projects

  • GitHub repo OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

    Project mention: Data mapping process | | 2021-11-05

    In terms of open source - is OpenRefine what you are after?

  • GitHub repo Smile

    Statistical Machine Intelligence & Learning Engine

    Project mention: What libraries do you use for machine learning and data visualizing in scala? | | 2021-11-27

    I use smile with ammonite and it feels pretty easy/good to work with. Of course for pure looking at data, and exploration, you're not going to beat python.

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo airbyte

    Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

    Project mention: How to improve ETL tech stack for robust API calls, historization, orchestration and versioning? | | 2021-12-01

    Shout out - decent flexibility, self-hosting on Kubernetes, and a commitment to no premium connectors.

  • GitHub repo Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (

    Project mention: Learn SQL | | 2021-08-03

    You might find interesting. It allows you to bolt on a MPP SQL execution engine on top of any data source including pre-built connectors for Druid and Kafka.

    It's all ANSI SQL and the best part is you can combine data from heterogenous sources. e.g. You can join data between a topic in Kafka and a table in Druid or even between Kafka, S3 and your RDBMS.

    Disclaimer: I'm a maintainer of the project.

  • GitHub repo Tablesaw

    Java dataframe and visualization library

    Project mention: Does Java has similar project like this one in C#? (ml, data) | | 2021-05-23

    For data frames, tablesaw or anything with apache arrow interop would be a good way to go:

  • GitHub repo DatumBox

    Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

  • GitHub repo zingg

    Scalable data mastering, deduplication and entity resolution.

    Project mention: GitHub Java Projects to Contribute | | 2021-11-17

    Check Zingg out at and let me know if you would like to contribute

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo rumble

    ⛈️ RumbleDB 1.16.0 "Shagbark Hickory" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more (by RumbleDB)

    Project mention: RumbleDB: Query with ease a lot of different nested, heterogeneous data formats | | 2021-12-01
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-12-01.


What are some of the best open-source Data Science projects in Java? This list will help you:

Project Stars
1 OpenRefine 8,509
2 Smile 5,402
3 airbyte 4,652
4 Trino 4,460
5 Tablesaw 2,758
6 DatumBox 1,073
7 zingg 322
8 rumble 79
Find remote jobs at our new job board There are 33 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives