C++ Data Science

Open-source C++ projects categorized as Data Science

Top 23 C++ Data Science Projects

Data Science
  1. cudf

    cuDF - GPU DataFrame Library

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

    Project mention: 🚀 Why Your ML Service Needs Rust + CatBoost: A Setup Guide That Actually Works | dev.to | 2025-01-19

    [package] name = "MLApp" version = "0.1.0" edition = "2021" [dependencies] catboost = { git = "https://github.com/catboost/catboost", rev = "0bfdc35"}

  4. matplotplusplus

    Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

  5. GraphScope

    🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

  6. SHOGUN

    Shōgun

  7. DataFrame

    C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

  8. chdb

    chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

    Project mention: ClickHouse gets lazier (and faster): Introducing lazy materialization | news.ycombinator.com | 2025-04-22

    https://github.com/chdb-io/chdb/issues/101#issuecomment-2824...

    Ps. I work for ClickHouse

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. ArcticDB

    ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

    Project mention: ArcticDB: High performance, serverless DataFrame database | news.ycombinator.com | 2024-09-06
  11. TileDB

    The Universal Storage Engine

    Project mention: Ask HN: Who is hiring? (February 2025) | news.ycombinator.com | 2025-02-03

    TileDB, Inc. | Full-time | REMOTE | USA, Greece | https://tiledb.com/

    TileDB is the database designed for discovery, built to organize, structure, and analyze any data. Our solutions for single-cell and population genomics are used by major pharmaceutical companies and research institutes, and power large public data collections such as the Cellxgene Discover Census. We are actively hiring for several roles building our unified data catalog, scalable computation, and interactive analysis platform.

    - Infrastructure Engineer: Kubernetes, Terraform, Argo, Grafana, Prometheus, CloudWatch, GitOps; Golang, Python, C++, or Rust (GMT -8/+4).

    - Frontend/UI developer: Typescript, React; experience with high-performance/high-volume data and visualization applications. GMT -8/+1

    We are fully-remote, with optional co-working hubs in Cambridge, MA, New York, NY, and Athens, Greece. Apply today at https://ats.rippling.com/tiledb-careers/jobs or reach out directly (email in profile).

  12. MLPP

    A library created to revitalize C++ as a machine learning front end. Per aspera ad astra.

  13. vectordb

    Epsilla is a high performance Vector Database Management System

  14. turbodbc

    Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

  15. oneDAL

    oneAPI Data Analytics Library (oneDAL)

  16. GPBoost

    Combining tree-boosting with Gaussian process and mixed effects models

  17. desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Project mention: Show HN: Desbordante 2.3.0 is out, now supports macOS | news.ycombinator.com | 2025-02-04

    Desbordante, an open-source, high-performance data profiler that discovers and validates complex patterns in data, has released version 2.3.0. This update introduces two new patterns and adds support for macOS. Users can now install the Desbordante-core pip package on macOS via PyPi, compatible with CPython versions 3.8 through 3.13 and PyPy versions 3.7 through 3.10.

    Release notes are here: https://github.com/Desbordante/desbordante-core/releases/tag...

  18. Graphia

    A visualisation tool for the creation and analysis of graphs

  19. Tiger

    C++ Matrix -- High performance and accurate (e.g. edge cases) matrix math library with expression template arithmetic operators (by hosseinmoein)

  20. nelson

    The Nelson Programming Language (by nelson-lang)

  21. secure-xgboost

    Secure collaborative training and inference for XGBoost.

  22. TileDB-VCF

    Efficient variant-call data storage and retrieval library using the TileDB storage library.

  23. MachineLearning

    From linear regression towards neural networks... (by aromanro)

  24. twinning

    Data Twinning

  25. lesser_pandas

    Data Analysis library in C++

    Project mention: Show HN: Lesser Pandas – Data Analysis Library in C++ | news.ycombinator.com | 2025-05-22
  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Data Science discussion

Log in or Post with

C++ Data Science related posts

Index

What are some of the best open-source Data Science projects in C++? This list will help you:

# Project Stars
1 cudf 9,033
2 catboost 8,464
3 matplotplusplus 4,622
4 GraphScope 3,457
5 SHOGUN 3,045
6 DataFrame 2,742
7 chdb 2,409
8 ArcticDB 1,977
9 TileDB 1,961
10 MLPP 1,097
11 vectordb 861
12 turbodbc 637
13 oneDAL 636
14 GPBoost 619
15 desbordante-core 407
16 Graphia 251
17 Tiger 121
18 nelson 107
19 secure-xgboost 105
20 TileDB-VCF 95
21 MachineLearning 25
22 twinning 24
23 lesser_pandas 8

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com