InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python Science and Data analysis Projects
-
Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Libraries for data science and deep learning that are always changing
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Project mention: How to Get Started with Scikit-Learn: A Beginner-Friendly Guide to Machine Learning in Python | dev.to | 2025-04-24
As is the case with most Python libraries, it is open-source and free-to-use, making it easily accessible by anyone willing to learn machine learning, and it is built upon other open-source libraries within Python, like SciPy for advanced scientific operations, NumPy for efficient numerical computations, Matplotlib for data visualization, and Cython for increased efficiency and speed, similar to that of C/C++.
-
If you are interested in the subject, also take a look at NetworkDisk[1] which enable users of NetworkX[2] which maps graphs to databases.
[1] https://networkdisk.inria.fr/
[2] https://networkx.org/
-
-
[2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...
-
Project mention: Mathics 7.0 – Open-source alternative to Mathematica | news.ycombinator.com | 2024-12-07
It's an interesting exercise to think about why the performance of Sum[i, {i, 1, 100000}] differs between Mathics and MMA: Mathics just calls down to sympy, which I think just does the sum in Python [1]; Mathematica (likely) pattern-matches and computes the 100000th triangular number directly, since I know Mathematica relies heavily on standard tables of summations/integrals/etc.
[1] https://github.com/sympy/sympy/blob/master/sympy/concrete/su....
-
From what I've seen, there are sort of two paths. I'll provide a well known example from each.
1. lang specific distributed task library
For example, in Python, celery is a pretty popular task system. If you (the dev) are the one doing all the code and running the workflows, it might work well for you. You build the core code and functions, and it handles the processing and resource stuff with a little config.
* https://github.com/celery/celery
Or lower level:
* https://github.com/dask/dask
2. DAG Workflow systems
There are also whole systems for what you're describing. They've gotten especially popular in the ML ops and data engineering world. A common one is AirFlow:
* https://github.com/apache/airflow
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
statsmodels is the closest thing in python to R. statsmodels has mixed model support, but mgcv apparently requires more. It is well above my paygrade, but this seems relevant: https://github.com/statsmodels/statsmodels/issues/8029 (i.e. no out of the box support, you might be able to build an approximation on your own).
-
Have you heard of JIT libraries like numba (https://github.com/numba/numba)? It doesn't work for all python code, but can be helpful for the type of function you gave as an example. There's no need to rewrite anything, just add a decorator to the function. I don't really know how performance compares to C, for example.
-
-
BigDL
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.
Project mention: FlashMoE: DeepSeek-R1 671B and Qwen3MoE 235B with 1~2 Intel B580 GPU in IPEX-LLM | news.ycombinator.com | 2025-05-12 -
-
One could be a project for accuracy. By integrating physical models and with the inspiration of existing important projects like Skyfield or Astropy, this project could focus on providing the most accurate and performant results possible in Ruby. Contributors could help optimise the code, running benchmarks, and covering as many use cases as possible.
-
I also like contributing specifically to my field. As a PhD student and possibly future scientist, I have a vested interest in the quality of the software in my field–specifically, structural bioinformatics. I use several tools in this field and often find areas that can be improved, both for myself and others. As an example, consider this minor documentation change I added to the Biopython documentation.
-
statsforecast – Forecasting with statistical and econometric models
-
-
fugue
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
-
-
bcbio-nextgen
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
-
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Science and Data analysis discussion
Python Science and Data analysis related posts
-
FlashMoE: DeepSeek-R1 671B and Qwen3MoE 235B with 1~2 Intel B580 GPU in IPEX-LLM
-
Why Momentum Works (2017)
-
How to Get Started with Scikit-Learn: A Beginner-Friendly Guide to Machine Learning in Python
-
How to import sample data into a Python notebook on watsonx.ai and other questions…
-
Statsforecast: Fast Python forecasting with statistical and econometric models
-
DeepSeek R1 671B Q4_K_M with 1~2 Arc A770 on Xeon
-
MacBook Air M4
-
A note from our sponsor - InfluxDB
www.influxdata.com | 19 May 2025
Index
What are some of the best open-source Science and Data analysis projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Pandas | 45,442 |
2 | NumPy | 29,483 |
3 | NetworkX | 15,736 |
4 | pygwalker | 14,796 |
5 | SciPy | 13,679 |
6 | SymPy | 13,627 |
7 | Dask | 13,203 |
8 | statsmodels | 10,663 |
9 | Numba | 10,421 |
10 | PyMC | 9,014 |
11 | BigDL | 7,877 |
12 | orange | 5,185 |
13 | astropy | 4,686 |
14 | Biopython | 4,588 |
15 | statsforecast | 4,356 |
16 | blaze | 3,197 |
17 | fugue | 2,081 |
18 | Cubes | 1,484 |
19 | bcbio-nextgen | 1,005 |
20 | NIPY | 781 |
21 | Neupy | 737 |
22 | bccb | 622 |
23 | Bubbles | 452 |