C++ Data Science

Open-source C++ projects categorized as Data Science

Top 20 C++ Data Science Projects

Data Science
  • cudf

    cuDF - GPU DataFrame Library

    Project mention: Unleashing GPU Power: Supercharge Your Data Processing with cuDF | dev.to | 2024-06-21

    cuDF Documentation

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • matplotplusplus

    Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

    Project mention: Creating k-NN with C++ (from Scratch) | dev.to | 2024-01-11

    cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)

  • GraphScope

    🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统



  • DataFrame

    C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

    Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
  • chdb

    chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

    Project mention: Declarative Multi-Engine Data Stack with Ibis | dev.to | 2024-07-17

    Offloading part of a SQL query to serverless engines with DuckDB, polars, DataFusion, chdb etc.

  • TileDB

    The Universal Storage Engine

    Project mention: Ask HN: Who is hiring? (May 2024) | news.ycombinator.com | 2024-05-01

    TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)

    TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.

    TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))

    With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.

    Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog

    We are actively hiring for several roles including:

    - Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • ArcticDB

    ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

    Project mention: Speed Test - ArcticDB, HDF, Feather, Parquet | /r/algotrading | 2023-11-21

    ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.

  • MLPP

    A library created to revitalize C++ as a machine learning front end. Per aspera ad astra.

  • turbodbc

    Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

  • oneDAL

    oneAPI Data Analytics Library (oneDAL)

  • GPBoost

    Combining tree-boosting with Gaussian process and mixed effects models

  • desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11
  • Graphia

    A visualisation tool for the creation and analysis of graphs

    Project mention: NetworkX – Network Analysis in Python | news.ycombinator.com | 2023-12-08

    Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:



  • Tiger

    C++ Matrix -- High performance and accurate (e.g. edge cases) matrix math library with expression template arithmetic operators (by hosseinmoein)

  • secure-xgboost

    Secure collaborative training and inference for XGBoost.

  • nelson

    The Nelson Programming Language (by nelson-lang)

  • TileDB-VCF

    Efficient variant-call data storage and retrieval library using the TileDB storage library.

  • twinning

    Data Twinning

  • MachineLearning

    From linear regression towards neural networks... (by aromanro)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Data Science discussion

Log in or Post with

C++ Data Science related posts


What are some of the best open-source Data Science projects in C++? This list will help you:

Project Stars
1 cudf 8,042
2 matplotplusplus 4,116
3 GraphScope 3,194
4 SHOGUN 3,023
5 DataFrame 2,373
6 chdb 1,886
7 TileDB 1,814
8 ArcticDB 1,247
9 MLPP 1,070
10 turbodbc 607
11 oneDAL 606
12 GPBoost 535
13 desbordante-core 367
14 Graphia 231
15 Tiger 111
16 secure-xgboost 102
17 nelson 90
18 TileDB-VCF 83
19 twinning 24
20 MachineLearning 18

Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Did you konow that C++ is
the 6th most popular programming language
based on number of metions?