Top 23 Science and Data analysis Open-Source Projects

Pandas

393 41,923 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Project mention: Deploying a Serverless Dash App with AWS SAM and Lambda | dev.to | 2024-03-04

Dash is a Python framework that enables you to build interactive frontend applications without writing a single line of Javascript. Internally and in projects we like to use it in order to build a quick proof of concept for data driven applications because of the nice integration with Plotly and pandas. For this post, I'm going to assume that you're already familiar with Dash and won't explain that part in detail. Instead, we'll focus on what's necessary to make it run serverless.

NumPy

272 26,360 10.0 Python

The fundamental package for scientific computing with Python.

Project mention: Dot vs Matrix vs Element-wise multiplication in PyTorch | dev.to | 2024-03-20

In NumPy with @, dot() or matmul():

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
NetworkX

61 14,178 9.6 Python

Network Analysis in Python

Project mention: Routes to LANL from 186 sites on the Internet | news.ycombinator.com | 2024-03-04

Built from this data... https://github.com/networkx/networkx/blob/main/examples/grap...

SciPy

50 12,431 9.9 Python

SciPy library main repository

Project mention: What Is a Schur Decomposition? | news.ycombinator.com | 2024-03-04

I guess it is a rite of passage to rewrite it. I'm doing it for SciPy too together with Propack in [1]. Somebody already mentioned your repo. Thank you for your efforts.
[1]: https://github.com/scipy/scipy/issues/18566

SymPy

34 12,384 10.0 Python

A computer algebra system written in pure Python

Project mention: AutoCodeRover resolves 22% of real-world GitHub in SWE-bench lite | news.ycombinator.com | 2024-04-09

Thank you for your interest. There are some interesting examples in the SWE-bench-lite benchmark which are resolved by AutoCodeRover:
- From sympy: https://github.com/sympy/sympy/issues/13643. AutoCodeRover's patch for it: https://github.com/nus-apr/auto-code-rover/blob/main/results...
- Another one from scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/13070. AutoCodeRover's patch (https://github.com/nus-apr/auto-code-rover/blob/main/results...) modified a few lines below (compared to the developer patch) and wrote a different comment.
There are more examples in the results directory (https://github.com/nus-apr/auto-code-rover/tree/main/results).

Dask

32 11,999 9.6 Python

Parallel computing with task scheduling

Project mention: The Distributed Tensor Algebra Compiler (2022) | news.ycombinator.com | 2023-06-15

pygwalker

22 9,759 9.6 Python

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
statsmodels

8 9,534 9.4 Python

Statsmodels: statistical modeling and econometrics in Python
Numba

124 9,432 9.9 Python

NumPy aware dynamic Python compiler using LLVM

Project mention: Mojo🔥: Head -to-Head with Python and Numba | dev.to | 2023-09-27

Around the same time, I discovered Numba and was fascinated by how easily it could bring huge performance improvements to Python code.

PyMC

3 8,155 9.5 Python

Bayesian Modeling and Probabilistic Programming in Python
gonum

24 7,260 8.3 Go

Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

Project mention: How to set up interface to accept multi-dimension array? | /r/golang | 2023-07-13

But if you want to see what can be done for numeric stuff, check out gonum. Personally, I still wouldn't use Go, and I rather suspect it's still pretty easy to reach for something like what you're trying to do and not find it because Go just can't write that type sensibly, but you can at least see what is available, written by people who disagree with me about Go not being a great language for this.

Zeppelin

8 6,263 8.7 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

BigDL

5 5,910 9.9 Python

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, ModelScope, etc.

Project mention: LLaMA Now Goes Faster on CPUs | news.ycombinator.com | 2024-03-31

Any performance benchmark against intel's 'IPEX-LLM'[0] or others?
[0] - https://github.com/intel-analytics/ipex-llm

orange

27 4,604 9.6 Python

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

Project mention: Hierarchical Clustering | news.ycombinator.com | 2024-04-20

I know I've tooted its horn before, but Orange3 is a pretty neat Python-based GUI platform that makes this and a metric buttload of other statistical/ML techniques available to non-programmer types.
Just watch out for null character `x00` in the corpus. That always seems to kill it stone dead.
https://orangedatamining.com/
https://orange3.readthedocs.io/projects/orange-visual-progra...

astropy

26 4,210 9.9 Python

Astronomy and astrophysics core library

Project mention: Julia 1.10 Released | news.ycombinator.com | 2023-12-27

Astropy [0] lives at the heart of most work. It has a Python interface, often backed by Fortran and C++ extension modules. If you use Astropy, you're indirectly using libraries like ERFA [6] and cfitsio [7] which are in C/Fortran.
I personally end up doing a lot of work that uses the HEALPix sky tesselation, so I use healpy [2] as well.
Openorb is perhaps a good example of a pure-Fortran package that I use quite. frequently for orbit propagation [3].
In C, there's Rebound [4] (for N-body simulations) and ASSIST [5] (which extends Rebound to use JPL's pre-calculated positions of major perturbers, and expands the force model to account for general relativity).
There are many more, these are just ones that come to mind from frequent usage in the last few months.
[0] https://www.astropy.org/

Biopython

31 4,167 9.6 Python

Official git repository for Biopython (originally converted from CVS)

Project mention: Invitación a proyecto - Biopython en Español | /r/devsarg | 2023-07-23

Breeze

3 3,437 5.1 Scala

Breeze is a numerical processing library for Scala.
blaze

1 3,182 0.0 Python

NumPy and Pandas interface to Big Data

Project mention: Blaze: Fast query execution engine for Apache Spark | news.ycombinator.com | 2023-10-19

Unfortunate name overlap with an under-loved PyData project: https://blaze.pydata.org

Spark Notebook

0 3,147 0.0 JavaScript

Interactive and Reactive Data Science using Scala and Spark.
Stats

0 2,881 2.0 Go

A well tested and comprehensive Golang statistics library package with no dependencies. (by montanaflynn)
gonum/plot

8 2,631 5.4 Go

A repository for plotting and visualizing data (by gonum)

Project mention: The Golang Saga: A Coder’s Journey There and Back Again. Part 3: The Graphing Conundrum | dev.to | 2023-08-16

And with this map now we are ready to create a group bar chart for each station to find out which station is the best for each type of value. I found a helpful tutorial on gonum/plot, so I’m going to use plotter.NewBarChart for my purposes.

Interactive Parallel Computing with IPython

0 2,551 8.2 Jupyter Notebook

IPython Parallel: Interactive Parallel Computing in Python
RDKit

3 2,413 9.5 HTML

The official sources for the RDKit library
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Science and Data analysis related posts

Hierarchical Clustering
1 project | news.ycombinator.com | 20 Apr 2024
Orange Data Mining
1 project | news.ycombinator.com | 15 Apr 2024
AutoCodeRover resolves 22% of real-world GitHub in SWE-bench lite
8 projects | news.ycombinator.com | 9 Apr 2024
LLaMA Now Goes Faster on CPUs
16 projects | news.ycombinator.com | 31 Mar 2024
PyTorch Library for Running LLM on Intel CPU and GPU
1 project | news.ycombinator.com | 3 Apr 2024
The Graph of Wikipedia [video]
1 project | news.ycombinator.com | 1 Apr 2024
Dot vs Matrix vs Element-wise multiplication in PyTorch
2 projects | dev.to | 20 Mar 2024
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Science and Data analysis projects? This list will help you:

	Project	Stars
1	Pandas	41,923
2	NumPy	26,360
3	NetworkX	14,178
4	SciPy	12,431
5	SymPy	12,384
6	Dask	11,999
7	pygwalker	9,759
8	statsmodels	9,534
9	Numba	9,432
10	PyMC	8,155
11	gonum	7,260
12	Zeppelin	6,263
13	BigDL	5,910
14	orange	4,604
15	astropy	4,210
16	Biopython	4,167
17	Breeze	3,437
18	blaze	3,182
19	Spark Notebook	3,147
20	Stats	2,881
21	gonum/plot	2,631
22	Interactive Parallel Computing with IPython	2,551
23	RDKit	2,413