#Science and Data analysis

Open-source projects categorized as Science and Data analysis

Top 23 Science and Data analysis Open-Source Projects

  • GitHub repo Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: VBA vs. Power BI | reddit.com/r/FPandA | 2021-03-01

    VBA is used for writing up scripts that will automate some process in Excel. VBA performance is incredibly slow and honestly, terrible. You're better off learning some programming (Python) and libraries that will allow you to manipulate/clean/data wrangle. Look into pandas.

  • GitHub repo NumPy

    The fundamental package for scientific computing with Python.

    Project mention: Making A Synthesizer Using Python | reddit.com/r/Python | 2021-03-02

    What do you mean by uploads? If you mean additional libraries besides Python then, for control input you need the midi module from pygame and for audio output pyaudio. Other than that numpy, you can install these using pip.

  • Scout

    Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo PredictionIO

    PredictionIO, a machine learning server for developers and ML engineers.

  • GitHub repo NetworkX

    Network Analysis in Python

    Project mention: [P] I made Communities: a library of clustering algorithms for network graphs (link in comments) | reddit.com/r/MachineLearning | 2021-02-22

    It would be nice that communities natively supports both networkx and igraph data structures.

  • GitHub repo Dask

    Parallel computing with task scheduling

    Project mention: Too much data to preprocess to work with pandas — is pyspark.sql a feasible alternative? | reddit.com/r/PySpark | 2021-02-25

    I haven't used it myself I have to admit, but I think dask could fit your workflow. Spark might add a little bit too much overhead if you're not used to it and you're not using a distributed system but of course it would also work.

  • GitHub repo SciPy

    Scipy library main repository

    Project mention: I’m never updating Scipy | reddit.com/r/physicsmemes | 2021-01-26

    Link: https://github.com/scipy/scipy/releases

  • GitHub repo SymPy

    A computer algebra system written in pure Python

    Project mention: Python Math Library made in 3 Days as a 14 year-old - libmaths | reddit.com/r/Python | 2021-02-23

    Now compare that to SymPy: https://github.com/sympy/sympy/blob/9e8f62e059d83178c1d8a1e19acac5473bdbf1c1/sympy/ntheory/primetest.py#L472-L634

  • GitHub repo Numba

    NumPy aware dynamic Python compiler using LLVM

    Project mention: I need help to speed up my program! | reddit.com/r/learnpython | 2021-03-02

    The first thing I would do is write the code in a non-vectorized fashion to see where I could get rid of any unnecessary copying/allocating. Then you could rewrite the code using a more efficient sequence of vectorized operations, or you could JIT it using a library like numba

  • GitHub repo statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: [C] I have an MS in Statistics - how can I get better at coding? | reddit.com/r/statistics | 2021-01-04
  • GitHub repo PyMC

    Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

  • GitHub repo Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

  • GitHub repo gonum

    Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

    Project mention: Loop elimination | reddit.com/r/golang | 2021-01-04

    I'm not sure what exactly you are trying to accomplish, but there are already numeric packages https://github.com/gonum/gonum that has asm loops for the common stuff. And there's https://github.com/mmcloughlin/avo that makes working with assembly less painful.

  • GitHub repo BigDL

    BigDL: Distributed Deep Learning Framework for Apache Spark

  • GitHub repo Breeze

    Breeze is a numerical processing library for Scala.

  • GitHub repo Spark Notebook

    Interactive and Reactive Data Science using Scala and Spark.

  • GitHub repo blaze

    NumPy and Pandas interface to Big Data

  • GitHub repo astropy

    Repository for the Astropy core package

  • GitHub repo orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: Informatica per la SCIENZA, per un ignorante in materia. | reddit.com/r/ItalyInformatica | 2021-02-28
  • GitHub repo Biopython

    Official git repository for Biopython (originally converted from CVS)

    Project mention: How is computer science used in biotechnology? | reddit.com/r/biotech | 2021-02-21

    You probably mean genetic engineering, which also uses a lot of software tools. The latest iteration, called synthetic biology, also relies heavily on computer-assisted DNA design, cloning and modelling of gene expression networks. You may check out Biopython, the Synthetic Biology Open Language (SBOL), the GBA software, or CUBA for examples of software used in synbio.

  • GitHub repo Algebird

    Abstract Algebra for Scala

  • GitHub repo Stats

    A well tested and comprehensive Golang statistics library package with no dependencies.

  • GitHub repo Interactive Parallel Computing with IPython

    Interactive Parallel Computing in Python

  • GitHub repo gonum/plot

    A repository for plotting and visualizing data

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-03-02.


What are some of the best open-source Science and Data analysis projects? This list will help you:

Project Stars
1 Pandas 28,657
2 NumPy 16,428
3 PredictionIO 12,500
4 NetworkX 8,731
5 Dask 7,965
6 SciPy 7,955
7 SymPy 7,866
8 Numba 6,153
9 statsmodels 6,056
10 PyMC 5,594
11 Zeppelin 5,150
12 gonum 4,617
13 BigDL 3,702
14 Breeze 3,221
15 Spark Notebook 3,015
16 blaze 2,928
17 astropy 2,642
18 orange 2,637
19 Biopython 2,619
20 Algebird 2,038
21 Stats 1,922
22 Interactive Parallel Computing with IPython 1,899
23 gonum/plot 1,828