Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 20 C++ Data Science Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
GraphScope
🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
-
DataFrame
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
ArcticDB
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
-
turbodbc
Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
-
desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
-
Tiger
C++ Matrix -- High performance and accurate (e.g. edge cases) matrix math library with expression template arithmetic operators (by hosseinmoein)
-
TileDB-VCF
Efficient variant-call data storage and retrieval library using the TileDB storage library.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.
cmake_minimum_required(VERSION 3.5) project(knn_cpp CXX) # Set up C++ version and properties include(CheckIncludeFileCXX) check_include_file_cxx(any HAS_ANY) check_include_file_cxx(string_view HAS_STRING_VIEW) check_include_file_cxx(coroutine HAS_COROUTINE) set(CMAKE_CXX_STANDARD 20) set(CMAKE_BUILD_TYPE Debug) set(CMAKE_CXX_STANDARD_REQUIRED ON) set(CMAKE_CXX_EXTENSIONS OFF) # Copy data file to build directory file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/iris.data DESTINATION ${CMAKE_CURRENT_BINARY_DIR}) # Download library usinng FetchContent include(FetchContent) FetchContent_Declare(matplotplusplus GIT_REPOSITORY https://github.com/alandefreitas/matplotplusplus GIT_TAG origin/master) FetchContent_GetProperties(matplotplusplus) if(NOT matplotplusplus_POPULATED) FetchContent_Populate(matplotplusplus) add_subdirectory(${matplotplusplus_SOURCE_DIR} ${matplotplusplus_BINARY_DIR} EXCLUDE_FROM_ALL) endif() FetchContent_Declare( fmt GIT_REPOSITORY https://github.com/fmtlib/fmt.git GIT_TAG 7.1.3 # Adjust the version as needed ) FetchContent_MakeAvailable(fmt) # Add executable and link project libraries and folders add_executable(${PROJECT_NAME} main.cc) target_link_libraries(${PROJECT_NAME} PUBLIC matplot fmt::fmt) aux_source_directory(lib LIB_SRC) target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) target_sources(${PROJECT_NAME} PRIVATE ${LIB_SRC}) add_subdirectory(tests)
Project mention: Show HN: Graphlearn-for-PyTorch, distributed graph learning on PyTorch | news.ycombinator.com | 2023-05-15Optimizing distributed sampling and feature lookup looks really attractive. It's really challenging to deploy GNN training at an industrial-scale for a large graph.
Will GLT be part of graphscope[1] and replacing the current graphscope-for-learning implementation?
[1]: https://github.com/alibaba/GraphScope
Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
TileDB, Inc. | Full-Time | REMOTE | USA, Greece/EU | [https://tiledb.com](https://tiledb.com/)
TileDB has recently announced a $34 million Series B fund-raise and is actively hiring for engineers across a range of roles (SRE, backend/distributed systems, database internals, and more). You will have the opportunity to work on innovative technology that creates impact for challenging problems in genomics, geospatial, machine learning, distributed systems, and many other areas.
TileDB Cloud is the modern database, allowing developers and scientists to capture, analyze, and share any data with any tool. We build on a broad foundation of open source, maintaining the TileDB storage engine, libraries for genomics (single-cell and population), geospatial (raster, point clouds, and more), a TileDB visualization engine extending Babylon.js, and much more ([github.com/TileDB-Inc/TileDB](http://github.com/TileDB-Inc/TileDB))
With TileDB, all data — tables, genomics, images, videos, location, time-series — is captured as multi-dimensional arrays. To supercharge this data, TileDB Cloud implements a serverless infrastructure delivering query execution, access control, data and code sharing, and distributed computing at global scale — eliminating cluster management, minimizing TCO, and promoting scientific collaboration and reproducibility.
Website: [https://tiledb.com](https://tiledb.com/) | GitHub: https://github.com/TileDB-Inc/TileDB | Blog: https://tiledb.com/blog
We are actively hiring for several roles including:
- Site Reliability Engineer (k8s, Terraform, automation, Prometheus, CloudWatch, GitOps; Golang, Python)
ArcticDB is a new data store for pandas DataFrames (https://arcticdb.io/). I have no affiliation with the project but wanted to see how it would compare on speed versus the other file format storage options available in Pandas: HDF, Feather, and Parquet. I could not find much on-line about how Arctic compares to the other options in terms of speed, so I ran some tests myself.
Export the graph to GML or to GraphML or to GraphViz DOT or to some other Graph format. BTW I recommend 3D graph visualization over 2D when possible, that is when you're exploring interactively as opposed to printing figures. The Graphia tool is the only FOSS tool for this purpose that I know of:
https://graphia.app
https://github.com/graphia-app/graphia
Softmax is at the end of this source file: https://github.com/aromanro/MachineLearning/blob/master/MachineLearning/MachineLearning/ActivationFunctions.h
C++ Data Science related posts
-
DB Pilot: Query Postgres, files, S3 and more – all at once, from your laptop
-
Is ClickHouse Moving Away from Open Source?
-
ChDB: An Embedded OLAP SQL Engine Powered by ClickHouse
-
ChDB: Embedded OLAP SQL Engine Powered by ClickHouse
-
PRQL, Pipelined Relational Query Language
-
Get gradient of Softmax activation
-
How to learn Linear Regression
-
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024
Index
What are some of the best open-source Data Science projects in C++? This list will help you:
Project | Stars | |
---|---|---|
1 | cudf | 7,311 |
2 | matplotplusplus | 3,949 |
3 | GraphScope | 3,109 |
4 | SHOGUN | 3,008 |
5 | DataFrame | 2,280 |
6 | TileDB | 1,771 |
7 | chdb | 1,726 |
8 | ArcticDB | 1,123 |
9 | MLPP | 1,054 |
10 | turbodbc | 604 |
11 | oneDAL | 593 |
12 | GPBoost | 513 |
13 | desbordante-core | 354 |
14 | Graphia | 222 |
15 | Tiger | 107 |
16 | secure-xgboost | 101 |
17 | nelson | 86 |
18 | TileDB-VCF | 80 |
19 | twinning | 24 |
20 | MachineLearning | 17 |
Sponsored