C++ Data Science

Open-source C++ projects categorized as Data Science

Top 20 C++ Data Science Projects

  • cudf

    cuDF - GPU DataFrame Library

  • Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

    The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

  • matplotplusplus

    Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

  • Project mention: Creating k-NN with C++ (from Scratch) | dev.to | 2024-01-11

    cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • GraphScope

    🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

  • Project mention: Show HN: Graphlearn-for-PyTorch, distributed graph learning on PyTorch | news.ycombinator.com | 2023-05-15

    Optimizing distributed sampling and feature lookup looks really attractive. It's really challenging to deploy GNN training at an industrial-scale for a large graph.

    Will GLT be part of graphscope[1] and replacing the current graphscope-for-learning implementation?

    [1]: https://github.com/alibaba/GraphScope

  • SHOGUN

    Shōgun

  • DataFrame

    C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

  • Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
  • TileDB

    The Universal Storage Engine

  • Project mention: Ask HN: Who is hiring? (May 2024) | news.ycombinator.com | 2024-05-01

    TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)

    TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.

    TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))

    With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.

    Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog

    We are actively hiring for several roles including:

    - Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)

  • chdb

    chDB is an embedded OLAP SQL Engine 🚀 powered by ClickHouse

  • Project mention: FLaNK Stack Weekly 06 Nov 2023 | dev.to | 2023-11-06
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • ArcticDB

    ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

  • Project mention: Speed Test - ArcticDB, HDF, Feather, Parquet | /r/algotrading | 2023-11-21

    ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.

  • MLPP

    A library created to revitalize C++ as a machine learning front end. Per aspera ad astra.

  • turbodbc

    Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

  • oneDAL

    oneAPI Data Analytics Library (oneDAL)

  • GPBoost

    Combining tree-boosting with Gaussian process and mixed effects models

  • desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

  • Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11
  • Graphia

    A visualisation tool for the creation and analysis of graphs

  • Project mention: NetworkX – Network Analysis in Python | news.ycombinator.com | 2023-12-08

    Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:

    https://graphia.app

    https://github.com/graphia-app/graphia

  • Tiger

    C++ Matrix -- High performance and accurate (e.g. edge cases) matrix math library with expression template arithmetic operators (by hosseinmoein)

  • secure-xgboost

    Secure collaborative training and inference for XGBoost.

  • nelson

    The Nelson Programming Language (by nelson-lang)

  • TileDB-VCF

    Efficient variant-call data storage and retrieval library using the TileDB storage library.

  • twinning

    Data Twinning

  • MachineLearning

    From linear regression towards neural networks... (by aromanro)

  • Project mention: Get gradient of Softmax activation | /r/learnmachinelearning | 2023-07-12

    Softmax is at the end of this source file: https://github.com/aromanro/MachineLearning/blob/master/MachineLearning/MachineLearning/ActivationFunctions.h

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Data Science related posts

Index

What are some of the best open-source Data Science projects in C++? This list will help you:

Project Stars
1 cudf 7,311
2 matplotplusplus 3,949
3 GraphScope 3,109
4 SHOGUN 3,008
5 DataFrame 2,280
6 TileDB 1,771
7 chdb 1,726
8 ArcticDB 1,123
9 MLPP 1,054
10 turbodbc 604
11 oneDAL 593
12 GPBoost 513
13 desbordante-core 354
14 Graphia 222
15 Tiger 107
16 secure-xgboost 101
17 nelson 86
18 TileDB-VCF 80
19 twinning 24
20 MachineLearning 17

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com