Java ETL

Open-source Java projects categorized as ETL

Top 9 Java ETL Projects

  • doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | | 2024-03-27

    As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

  • kestra

    Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

    Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

    Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here:

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • zingg

    Scalable identity resolution, entity resolution, data mastering and deduplication using ML

  • Smooks

    Extensible data integration Java framework for building XML and non-XML fragment-based applications

  • ReplicaDB

    ReplicaDB is open source tool for database replication, designed for efficiently transferring bulk data between relational and non-relational databases

  • kafka-connect-file-pulse

    🔗 A multipurpose Kafka Connect connector that makes it easy to parse, transform and stream any file, in any format, into Apache Kafka

    Project mention: Kafka Connect Filepulse 2.13.0 is now available! This version includes support for SFTP and Alibaba OSS. It also contains many bug fixes and improvements. 🚀 | /r/apachekafka | 2023-09-15
  • neo4j-jdbc

    Official Neo4j JDBC Driver

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • contube

    ConTube: A scalable data connector framework that facilitates efficient data transfer between diverse systems.

    Project mention: Show HN: ConTube – A Scalable Data Connect Framework for Pulsar/Kafka Ecosystems | | 2023-12-04
  • dcc-import

    Reference data importers

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-27.

Java ETL related posts


What are some of the best open-source ETL projects in Java? This list will help you:

Project Stars
1 doris 11,218
2 kestra 6,188
3 zingg 874
4 Smooks 382
5 ReplicaDB 351
6 kafka-connect-file-pulse 304
7 neo4j-jdbc 121
8 contube 10
9 dcc-import 1
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives