SaaSHub helps you find the best software and product alternatives Learn more →
Top 14 Java ETL Projects
-
Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
-
The initial data collection is usually done through a web-based application where compliance teams input client or vendor information. Once a record is created, you can use ETL tools like Apache Hop to extract data from multiple sources (like financial records, regulatory filings, and public databases) in real time and store them in scalable databases like PostgreSQL or MongoDB for easy access and management.
-
-
ReplicaDB
ReplicaDB is open source tool for database replication, designed for efficiently transferring bulk data between relational and non-relational databases
-
Smooks
An extensible Java framework for building event-driven applications that break up XML and non-XML data into chunks for data integration
-
kafka-connect-file-pulse
🔗 A multipurpose Kafka Connect connector that makes it easy to parse, transform and stream any file, in any format, into Apache Kafka
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
fhir-data-pipes
A collection of tools for extracting FHIR resources and analytics services on top of that data.
Project mention: Launch HN: Metriport (YC S22) – Open-source API for healthcare data exchange | news.ycombinator.com | 2024-05-23Thank you - glad to see there are others that are aware of the mess of healthcare data!
> Would it make sense to go one step further and bet on the future being the cloud - and start supporting existing cloud solution like Google Healthcare (FHIR) API (and others) as storage layers?
Oh for sure - to clarify, we're open-source, but we definitely have a managed cloud solution. For our backend, we currently self-host the OSS version of HAPI FHIR on AWS: https://github.com/metriport/fhir-server. It works pretty well for our purposes, and we'd prefer to not use a managed solution like the Google FHIR storage for this. Mainly for customizability, control, and to keep things OSS.
With that being said, people using Metriport can store the FHIR data and raw docs coming from our API in whatever solution they wish - including the Google FHIR storage! Everything is standardized to FHIR R4, so syncing to another backend is straightforward.
In fact, a customer of ours recently used this OSS solution to sync Metriport data to their Google cloud: https://github.com/google/fhir-data-pipes
-
-
examples
🌟 Examples of use cases that utilize Decodable, as well as demos for related open-source projects such as Apache Flink, Debezium, and Postgres. (by decodableco)
-
Hello, I've been working on langchain-beam library. Its a langchain and apache beam integration to use langchain's components like LLM interface in apache beam ETL pipeline and leverage LLM's capabilities for data processing, transformations and provide a way to create RAG based ETL pipelines.
recently I've added a feature to integrate embedding models into beam pipeline and generate vector embeddings for text in pipeline using the models so that embedding generation activity can be a part of the data pipeline instead of separate service.
I’d love to hear your thoughts. Repo - https://github.com/Ganeshsivakumar/langchain-beam
Example usage, to create embeddings in pipeline :
-
contube
ConTube: A scalable data connector framework that facilitates efficient data transfer between diverse systems.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Java ETL discussion
Java ETL related posts
-
Generating text embeddings in ETL pipeline
-
Java DataFrame library 1.0 GA release
-
Kafka Connect Filepulse 2.13.0 is now available! This version includes support for SFTP and Alibaba OSS. It also contains many bug fixes and improvements. 🚀
-
Best ‘E’TL tools for extracting data from on-prem SQL databases
-
Maven unable to resolve a dependency given in pom.xml. I've instead tried manually downloading installing the jar, but now maven cannot find the package.
-
Download json and csv file from github repository with apache kafka
-
Streaming data into Kafka S01/E04 — Loading Log files using Grok Expression
-
A note from our sponsor - SaaSHub
www.saashub.com | 25 Mar 2025
Index
What are some of the best open-source ETL projects in Java? This list will help you:
# | Project | Stars |
---|---|---|
1 | doris | 13,362 |
2 | flink-cdc | 6,003 |
3 | hop | 1,097 |
4 | zingg | 1,001 |
5 | ReplicaDB | 432 |
6 | Smooks | 400 |
7 | kafka-connect-file-pulse | 325 |
8 | dflib | 251 |
9 | fhir-data-pipes | 170 |
10 | neo4j-jdbc | 139 |
11 | examples | 67 |
12 | langchain-beam | 17 |
13 | contube | 10 |
14 | dcc-import | 1 |