Top 23 Bigdata Open-Source Projects

  • TDengine

    An open-source time-series database with high-performance, scalability and SQL support. It can be widely used in IoT, Connected Vehicles, DevOps, Energy, Finance and other fields.

    Project mention: TDengine: NEW Data - star count:18705.0 | reddit.com/r/algoprojects | 2022-08-06
  • shardingsphere

    Ecosystem to transform any database into a distributed database system, and enhance it with sharding, elastic scaling, encryption features & more

    Project mention: DistSQL Applications: Building a Dynamic Distributed Database | dev.to | 2022-07-26

    GitHub Issues

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

  • vaex

    Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

    Project mention: preprocessing millions of records - how to speed up the processing | reddit.com/r/datascience | 2022-06-03

    Try vaex, vaex, using lazy evaluation and parallel calculations, you should be fine.

  • hudi

    Upserts, Deletes And Incremental Processing on Big Data.

    Project mention: Big Data file formats | reddit.com/r/apachespark | 2022-06-13
  • dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • volcano

    A Cloud Native Batch System (Project under CNCF)

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • Apache Avro

    Apache Avro is a data serialization system.

    Project mention: Marshaling objects in modern Java | reddit.com/r/java | 2022-06-23

    If binary format is OK, use Protocol Buffer or Avro . Note that in the case of binary formats, you need a schema to serialize/de-serialize your data. Therefore, you'd probably want a schema registry to store all past and present schemas for later usage.

  • griddb

    GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

    Project mention: griddb: NEW Data - star count:1807.0 | reddit.com/r/algoprojects | 2022-08-06
  • spark

    .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers. (by dotnet)

    Project mention: Does anyone actually use ML.NET? | reddit.com/r/dotnet | 2022-06-21

    Re: DataFrames, that's good to know. There is the DataFrame API which is part of the Microsoft.Data.Analysis NuGet package and that's the API that the issue is tracking and shown in the sample notebook I shared. That API has no dependencies on other systems. The DataFrame you're referring to is part of the .NET for Apache Spark library which has the dependency on Apache Spark which rqeuires some initial setup.

  • tensorbase

    TensorBase is a new big data warehousing with modern efforts.

  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • OpenMetadata

    Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

    Project mention: OpenMetadata: Open Standard for Metadata | news.ycombinator.com | 2022-05-24
  • kube-batch

    A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

  • Mobius: C# API for Spark

    C# and F# language binding and extensions to Apache Spark (by microsoft)

  • cds

    Data syncing in golang for ClickHouse. (by zeromicro)

    Project mention: ClickHouse, Inc | news.ycombinator.com | 2021-09-20
  • Gearpump

    Lightweight real-time big data streaming engine over Akka

  • cortx

    CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.

  • visualpython

    GUI-based Python code generator for data science.

  • sidekick

    High Performance HTTP Sidecar Load Balancer (by minio)

  • docker-spark-cluster

    A simple spark standalone cluster for your testing environment purposses

  • feedirss-api

    RSS as RESTful. This service allows you to transform RSS feed into an awesome API.

  • kotlin-spark-api

    This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-08-06.

Bigdata related posts

Index

What are some of the best open-source Bigdata projects? This list will help you:

Project Stars
1 TDengine 18,768
2 shardingsphere 16,867
3 awesome-bigdata 11,088
4 vaex 7,221
5 hudi 3,422
6 dpark 2,679
7 volcano 2,544
8 Apache Avro 2,202
9 griddb 1,810
10 spark 1,797
11 tensorbase 1,254
12 Optimus 1,241
13 OpenMetadata 1,212
14 kube-batch 985
15 Mobius: C# API for Spark 932
16 cds 840
17 Gearpump 761
18 cortx 580
19 visualpython 494
20 sidekick 435
21 docker-spark-cluster 373
22 feedirss-api 341
23 kotlin-spark-api 327
Find remote jobs at our new job board 99remotejobs.com. There are 3 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com