Java Data Science

Open-source Java projects categorized as Data Science Edit details

Top 8 Java Data Science Projects

  • OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

    Project mention: Cannot create table from CSV file in BigQuery. | reddit.com/r/learnSQL | 2022-06-05

    I'm not familiar with BigQuery but could it be inconsistencies in the data maybe? So I mean missing commas or quotes or incorrect datetime formats or something like that. You can use the CSV Lint plug-in in Notepad++ or install OpenRefine so check for those type for errors.

  • airbyte

    Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

    Project mention: Ask HN: How are you dealing with the M1/ARM migration? | news.ycombinator.com | 2022-06-10
  • JetBrains

    Developer Ecosystem Survey 2022. Take part in the Developer Ecosystem Survey 2022 by JetBrains and get a chance to win a Macbook, a Nvidia graphics card, or other prizes. We’ll create an infographic full of stats, and you’ll get personalized results so you can compare yourself with other developers.

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

    Project mention: How-to-Guide: Contributing to Open Source | reddit.com/r/dataengineering | 2022-06-11

    Although Trino (formerly Presto) is in the awesome for beginners list, it’s also a really good DE project as it is a distributed query engine that connects to most of the projects listed above. So depending on where you work in this project you can gain a depth of knowledge on the query engine or breadth across all the connectors …or go hybrid .

  • Smile

    Statistical Machine Intelligence & Learning Engine

    Project mention: What libraries do you use for machine learning and data visualizing in scala? | reddit.com/r/scala | 2021-11-27

    I use smile https://github.com/haifengl/smile with ammonite and it feels pretty easy/good to work with. Of course for pure looking at data, and exploration, you're not going to beat python.

  • Tablesaw

    Java dataframe and visualization library

  • DatumBox

    Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

  • zingg

    Scalable entity resolution, data mastering and deduplication using ML

    Project mention: is it possible to "fuzzy match" or dedupe columns in Redshift? | reddit.com/r/aws | 2022-04-30

    If you are open to using a framework for this, check Zingg at https://github.com/zinggAI/zingg. It connects to Redshift, snowflake and other warehouses and can handle multiple columns

  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • rumble

    ⛈️ RumbleDB 1.19.0 "Tipuana Tipu" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more (by RumbleDB)

    Project mention: RumbleDB: Query with ease a lot of different nested, heterogeneous data formats | news.ycombinator.com | 2021-12-01
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-06-11.

Java Data Science related posts

Index

What are some of the best open-source Data Science projects in Java? This list will help you:

Project Stars
1 OpenRefine 8,870
2 airbyte 7,123
3 Trino 5,622
4 Smile 5,531
5 Tablesaw 2,932
6 DatumBox 1,077
7 zingg 537
8 rumble 173
Find remote jobs at our new job board 99remotejobs.com. There are 4 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Less time debugging, more time building
Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
scoutapm.com