Java ETL

Open-source Java projects categorized as ETL

Top 14 Java ETL Projects

  1. doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. hop

    Hop Orchestration Platform

    Project mention: Automating Enhanced Due Diligence in Regulated Applications | dev.to | 2025-02-13

    The initial data collection is usually done through a web-based application where compliance teams input client or vendor information. Once a record is created, you can use ETL tools like Apache Hop to extract data from multiple sources (like financial records, regulatory filings, and public databases) in real time and store them in scalable databases like PostgreSQL or MongoDB for easy access and management.

  4. zingg

    Scalable identity resolution, entity resolution, data mastering and deduplication using ML

  5. ReplicaDB

    ReplicaDB is open source tool for database replication, designed for efficiently transferring bulk data between relational and non-relational databases

  6. Smooks

    An extensible Java framework for building event-driven applications that break up XML and non-XML data into chunks for data integration

  7. kafka-connect-file-pulse

    🔗 A multipurpose Kafka Connect connector that makes it easy to parse, transform and stream any file, in any format, into Apache Kafka

  8. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  9. dflib

    In-memory Java DataFrame library

    Project mention: Java DataFrame library 1.0 GA release | news.ycombinator.com | 2024-12-18
  10. fhir-data-pipes

    A collection of tools for extracting FHIR resources and analytics services on top of that data.

    Project mention: Launch HN: Metriport (YC S22) – Open-source API for healthcare data exchange | news.ycombinator.com | 2024-05-23

    Thank you - glad to see there are others that are aware of the mess of healthcare data!

    > Would it make sense to go one step further and bet on the future being the cloud - and start supporting existing cloud solution like Google Healthcare (FHIR) API (and others) as storage layers?

    Oh for sure - to clarify, we're open-source, but we definitely have a managed cloud solution. For our backend, we currently self-host the OSS version of HAPI FHIR on AWS: https://github.com/metriport/fhir-server. It works pretty well for our purposes, and we'd prefer to not use a managed solution like the Google FHIR storage for this. Mainly for customizability, control, and to keep things OSS.

    With that being said, people using Metriport can store the FHIR data and raw docs coming from our API in whatever solution they wish - including the Google FHIR storage! Everything is standardized to FHIR R4, so syncing to another backend is straightforward.

    In fact, a customer of ours recently used this OSS solution to sync Metriport data to their Google cloud: https://github.com/google/fhir-data-pipes

  11. neo4j-jdbc

    Official Neo4j JDBC Driver

  12. examples

    🌟 Examples of use cases that utilize Decodable, as well as demos for related open-source projects such as Apache Flink, Debezium, and Postgres. (by decodableco)

  13. langchain-beam

    Integrates LLMs as PTransform in Apache Beam pipelines using LangChain

    Project mention: Generating text embeddings in ETL pipeline | news.ycombinator.com | 2025-01-13

    Hello, I've been working on langchain-beam library. Its a langchain and apache beam integration to use langchain's components like LLM interface in apache beam ETL pipeline and leverage LLM's capabilities for data processing, transformations and provide a way to create RAG based ETL pipelines.

    recently I've added a feature to integrate embedding models into beam pipeline and generate vector embeddings for text in pipeline using the models so that embedding generation activity can be a part of the data pipeline instead of separate service.

    I’d love to hear your thoughts. Repo - https://github.com/Ganeshsivakumar/langchain-beam

    Example usage, to create embeddings in pipeline :

  14. contube

    ConTube: A scalable data connector framework that facilitates efficient data transfer between diverse systems.

  15. dcc-import

    Reference data importers

  16. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Java ETL discussion

Log in or Post with

Java ETL related posts

  • Generating text embeddings in ETL pipeline

    1 project | news.ycombinator.com | 13 Jan 2025
  • Java DataFrame library 1.0 GA release

    1 project | news.ycombinator.com | 18 Dec 2024
  • Kafka Connect Filepulse 2.13.0 is now available! This version includes support for SFTP and Alibaba OSS. It also contains many bug fixes and improvements. 🚀

    1 project | /r/apachekafka | 15 Sep 2023
  • Best ‘E’TL tools for extracting data from on-prem SQL databases

    2 projects | /r/snowflake | 28 Mar 2022
  • Maven unable to resolve a dependency given in pom.xml. I've instead tried manually downloading installing the jar, but now maven cannot find the package.

    1 project | /r/learnjava | 30 Aug 2021
  • Download json and csv file from github repository with apache kafka

    1 project | /r/apachekafka | 29 Jul 2021
  • Streaming data into Kafka S01/E04 — Loading Log files using Grok Expression

    5 projects | dev.to | 5 Jan 2021
  • A note from our sponsor - SaaSHub
    www.saashub.com | 25 Mar 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source ETL projects in Java? This list will help you:

# Project Stars
1 doris 13,362
2 flink-cdc 6,003
3 hop 1,097
4 zingg 1,001
5 ReplicaDB 432
6 Smooks 400
7 kafka-connect-file-pulse 325
8 dflib 251
9 fhir-data-pipes 170
10 neo4j-jdbc 139
11 examples 67
12 langchain-beam 17
13 contube 10
14 dcc-import 1

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai