C++ Data Science

Open-source C++ projects categorized as Data Science

Top 17 C++ Data Science Projects

  • cudf

    cuDF - GPU DataFrame Library

    Project mention: Introducing TeaScript C++ Library | reddit.com/r/cpp | 2023-02-16

    Yes sure, that is how OpenMP does; but on the other side: you seem to already do some basic type inference, and building an AST, no? Then you know as well the size and type of your vectors, and can execute actions in parallel if there is enough data to be worth parallelizing. Is there anyone who don't want their code to execute faster if it is possible? Those that do work in big data domain do use threads and vectorized instructions without user having to type in any directive; just import different library. Example, numpy or numpy with cuda backend, or similar GPU accelerated libraries like cudf.

  • matplotplusplus

    Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

    Project mention: Best Library to Visualize Mathematical Concepts | reddit.com/r/cpp_questions | 2023-03-02

    The best way to visualize most mathematical concepts is by plotting a 2D graph. To do that you can use e.g. Matplot++

  • Sonar

    Write Clean C++ Code. Always.. Sonar helps you commit clean C++ code every time. With over 550 unique rules to find C++ bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • SHOGUN

    Shōgun

    Project mention: Changing std:sort at Google’s Scale and Beyond | news.ycombinator.com | 2022-04-20

    The function is trying to get the median, which is not defined for an empty set. With this particular implementation, there is an assert for that:

    https://github.com/shogun-toolbox/shogun/blob/9b8d85/src/sho...

    Unrelatedly, but from the same section:

    > Fixes are trivial, access the nth element only after the call being made. Be careful.

    Wouldn't the proper fix to do the nth_element for the larget element first (for those cases that don't do that already) and then adjust the end to be the begin + larger_n for the second nth_element call? Otherwise the second call will check [begin + larger_n, end) again for no reason at all.

  • DataFrame

    C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

    Project mention: DataFrame: NEW Data - star count:1772.0 | reddit.com/r/algoprojects | 2023-03-11
  • TileDB

    The Universal Storage Engine

    Project mention: Ask HN: Who is hiring? (December 2022) | news.ycombinator.com | 2022-12-01

    TileDB, Inc. | Full-Time | REMOTE | USA | Greece | https://tiledb.com

    TileDB transforms the lives of analytics professionals and data scientists with a universal database, allowing them to access, analyze, and share any data with any tool at global scale. TileDB unifies the way we think about data, delivering superior performance and foundational data management capabilities. All data — tables, genomics, images, videos, location, time-series — across multiple domains is captured as multi-dimensional arrays. TileDB offers extreme interoperability via numerous APIs and tool integrations across the data science ecosystem, eliminating the hassles and inefficiencies of data conversion. TileDB Cloud implements a totally serverless infrastructure and delivers access control, easier data and code sharing and distributed computing at global scale, eliminating cluster management, minimizing TCO and promoting scientific collaboration and reproducibility.

    TileDB, Inc. was spun out of MIT and Intel Labs in May 2017 and is backed by Two Bear Capital, Nexus Venture Partners, Uncorrelated Ventures, Intel Capital and Big Pi.

    Recent HN article: https://news.ycombinator.com/item?id=23896131

    Website: https://tiledb.com

    GitHub: https://github.com/TileDB-Inc/TileDB

    Docs: https://docs.tiledb.com

    Blog: https://tiledb.com/blog

    Our headquarters are located in Cambridge, MA and we have a subsidiary in Athens, Greece. We offer the ability to work remotely. If you are located outside of the USA and Greece we have options to accommodate this, don't hesitate to apply!

    We have several open positions aimed at increasing TileDB’s feature set, growth and adoption. You will have the opportunity to work on innovative technology that creates impact on challenging and exciting problems in Genomics, Geospatial, Time Series, and more. Immediate features on the roadmap for TileDB Cloud include, advanced distributed computations, advanced computation pushdown, improved multi-cloud deployments and more.

    We are actively seeking:

    - Senior Golang Engineer

    - Senior Python Engineer

    - Site Reliability Engineer

    - React Frontend Engineer

    Apply today at https://tiledb.workable.com !

  • MLPP

    A library created to revitalize C++ as a machine learning front end. Per aspera ad astra.

  • turbodbc

    Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

    Project mention: Arrowdantic 0.1.0 released | reddit.com/r/Python | 2022-04-16

    It supports reading from and writing to ODBC compliant databases at likely similar performance as turbodbc and it does not require conda to install.

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • oneDAL

    oneAPI Data Analytics Library (oneDAL)

    Project mention: Is there a no-compromise (presumably C/C++) platform similar to Apache Spark? | reddit.com/r/dataengineering | 2022-07-27
  • GPBoost

    Combining tree-boosting with Gaussian process and mixed effects models

  • Graphia

    A visualisation tool for the creation and analysis of graphs

  • secure-xgboost

    Secure collaborative training and inference for XGBoost.

  • Matrix

    C++ Matrix -- High performance and accurate (e.g. edge cases) matrix math library with expression template arithmetic operators (by hosseinmoein)

    Project mention: Update on C++ Algo Trading/ Data Analysis tool | reddit.com/r/algotrading | 2023-02-05

    Yes, I have. As matter of fact I have another open source (https://github.com/hosseinmoein/Matrix) that uses this technique.

  • TileDB-VCF

    Efficient variant-call data storage and retrieval library using the TileDB storage library.

    Project mention: Has anyone stored/queried VCFs and their variant records in a relational database? | reddit.com/r/bioinformatics | 2022-11-12

    Perhaps of interest https://github.com/TileDB-Inc/TileDB-VCF

  • nelson

    Nelson numerical interpreter

    Project mention: Nelson Numerical Software | news.ycombinator.com | 2023-02-15
  • Desbordante

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Project mention: Desbordante – an open-source data profiling tool | news.ycombinator.com | 2023-02-20
  • twinning

    Data Twinning

  • MachineLearning

    From linear regression towards neural networks... (by aromanro)

    Project mention: Invata cum functioneaza Chat GPT si retelele neuronale | reddit.com/r/programare | 2023-02-06
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-03-11.

C++ Data Science related posts

Index

What are some of the best open-source Data Science projects in C++? This list will help you:

Project Stars
1 cudf 5,386
2 matplotplusplus 3,165
3 SHOGUN 2,921
4 DataFrame 1,786
5 TileDB 1,475
6 MLPP 1,032
7 turbodbc 563
8 oneDAL 545
9 GPBoost 391
10 Graphia 168
11 secure-xgboost 93
12 Matrix 77
13 TileDB-VCF 62
14 nelson 60
15 Desbordante 39
16 twinning 23
17 MachineLearning 4
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com