Top 20 C++ Data Science Projects

cudf

23 7,311 9.9 C++

cuDF - GPU DataFrame Library

Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

matplotplusplus

26 3,949 5.8 C++

Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

Project mention: Creating k-NN with C++ (from Scratch) | dev.to | 2024-01-11

cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
GraphScope

10 3,109 9.7 C++

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

Project mention: Show HN: Graphlearn-for-PyTorch, distributed graph learning on PyTorch | news.ycombinator.com | 2023-05-15

Optimizing distributed sampling and feature lookup looks really attractive. It's really challenging to deploy GNN training at an industrial-scale for a large graph.
Will GLT be part of graphscope[1] and replacing the current graphscope-for-learning implementation?
[1]: https://github.com/alibaba/GraphScope

SHOGUN

1 3,008 4.8 C++

Shōgun
DataFrame

109 2,280 9.4 C++

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13

TileDB

14 1,771 9.7 C++

The Universal Storage Engine

Project mention: Ask HN: Who is hiring? (May 2024) | news.ycombinator.com | 2024-05-01

TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)
TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.
TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))
With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.
Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog
We are actively hiring for several roles including:
- Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)

chdb

18 1,726 9.5 C++

chDB is an embedded OLAP SQL Engine 🚀 powered by ClickHouse

Project mention: FLaNK Stack Weekly 06 Nov 2023 | dev.to | 2023-11-06

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
ArcticDB

4 1,123 9.8 C++

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

Project mention: Speed Test - ArcticDB, HDF, Feather, Parquet | /r/algotrading | 2023-11-21

ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.

MLPP

5 1,054 3.2 C++

A library created to revitalize C++ as a machine learning front end. Per aspera ad astra.
turbodbc

2 604 8.0 C++

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
oneDAL

1 593 9.3 C++

oneAPI Data Analytics Library (oneDAL)
GPBoost

3 513 9.4 C++

Combining tree-boosting with Gaussian process and mixed effects models
desbordante-core

2 354 9.5 C++

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11

Graphia

8 222 9.7 C++

A visualisation tool for the creation and analysis of graphs

Project mention: NetworkX – Network Analysis in Python | news.ycombinator.com | 2023-12-08

Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:
https://graphia.app
https://github.com/graphia-app/graphia

Tiger

4 107 5.0 C++

C++ Matrix -- High performance and accurate (e.g. edge cases) matrix math library with expression template arithmetic operators (by hosseinmoein)
secure-xgboost

1 101 0.0 C++

Secure collaborative training and inference for XGBoost.
nelson

5 86 9.5 C++

The Nelson Programming Language (by nelson-lang)
TileDB-VCF

4 80 8.6 C++

Efficient variant-call data storage and retrieval library using the TileDB storage library.
twinning

1 24 0.0 C++

Data Twinning
MachineLearning

6 17 6.9 C++

From linear regression towards neural networks... (by aromanro)

Project mention: Get gradient of Softmax activation | /r/learnmachinelearning | 2023-07-12

Softmax is at the end of this source file: https://github.com/aromanro/MachineLearning/blob/master/MachineLearning/MachineLearning/ActivationFunctions.h

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Data Science related posts

DB Pilot: Query Postgres, files, S3 and more – all at once, from your laptop

1 project | news.ycombinator.com | 27 Oct 2023
Is ClickHouse Moving Away from Open Source?

6 projects | news.ycombinator.com | 22 Sep 2023
ChDB: An Embedded OLAP SQL Engine Powered by ClickHouse

1 project | news.ycombinator.com | 5 Sep 2023
ChDB: Embedded OLAP SQL Engine Powered by ClickHouse

1 project | news.ycombinator.com | 13 Aug 2023
PRQL, Pipelined Relational Query Language

16 projects | news.ycombinator.com | 25 Jul 2023
Get gradient of Softmax activation

1 project | /r/learnmachinelearning | 12 Jul 2023
How to learn Linear Regression

2 projects | /r/learnmachinelearning | 10 May 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data Science projects in C++? This list will help you:

	Project	Stars
1	cudf	7,311
2	matplotplusplus	3,949
3	GraphScope	3,109
4	SHOGUN	3,008
5	DataFrame	2,280
6	TileDB	1,771
7	chdb	1,726
8	ArcticDB	1,123
9	MLPP	1,054
10	turbodbc	604
11	oneDAL	593
12	GPBoost	513
13	desbordante-core	354
14	Graphia	222
15	Tiger	107
16	secure-xgboost	101
17	nelson	86
18	TileDB-VCF	80
19	twinning	24
20	MachineLearning	17

C++ Data Science

Top 20 C++ Data Science Projects

C++ Data Science related posts

DB Pilot: Query Postgres, files, S3 and more – all at once, from your laptop

Is ClickHouse Moving Away from Open Source?

ChDB: An Embedded OLAP SQL Engine Powered by ClickHouse

ChDB: Embedded OLAP SQL Engine Powered by ClickHouse

PRQL, Pipelined Relational Query Language

Get gradient of Softmax activation

How to learn Linear Regression

Index