#Science and Data analysis

Open-source projects categorized as Science and Data analysis

Top 23 Science and Data analysis Open-Source Projects

  • GitHub repo Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Capture tabular data from docho.com | reddit.com/r/learnpython | 2021-04-18
  • GitHub repo NumPy

    The fundamental package for scientific computing with Python.

    Project mention: How to replace an integer with a letter in a dictionary | reddit.com/r/learnpython | 2021-04-20
  • GitHub repo NetworkX

    Network Analysis in Python

    Project mention: Is there another way to find all the cliques in a graph (dictionary)? | reddit.com/r/learnpython | 2021-04-06


  • GitHub repo Dask

    Parallel computing with task scheduling

    Project mention: Why is Python popular despite being accused of being slow? | reddit.com/r/programming | 2021-04-16

    Not everyone has the same "parallelism" needs. I have used mpi4py to distribute scientific computations using numpy over thousands of cores on hundreds of servers with much less effort than doing the same thing in C / C++ and almost no performance penalty (I could batch my data in big enough chunks). Today there are higher level distributed computing packages like dask that are even easier to use.

  • GitHub repo SciPy

    Scipy library main repository

    Project mention: That took a wild turn | reddit.com/r/ProgrammerHumor | 2021-04-15
  • GitHub repo SymPy

    A computer algebra system written in pure Python

    Project mention: Is the capitalization of sp.symbols vs sp.Symbol intentional in sympy? | reddit.com/r/learnpython | 2021-04-01

    symbols is a function

  • GitHub repo Numba

    NumPy aware dynamic Python compiler using LLVM

    Project mention: The best description of JS I've ever seen. | reddit.com/r/ProgrammerHumor | 2021-04-21

    That's probably because it was using a JIT compiler, probably V8, which is fantastic don't get me wrong but its apples to oranges. Give python a hand with something like numba and it'll probably come out that python is more even.

  • GitHub repo statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: [C] I have an MS in Statistics - how can I get better at coding? | reddit.com/r/statistics | 2021-01-04
  • GitHub repo PyMC

    Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara

  • GitHub repo Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: Is there a way to collaborate in real-time for Jupyter Notebooks? | reddit.com/r/learnpython | 2021-03-21

    Check out Zeppelin. It's similar to Jupyter and allows real-time editing by multiple users. https://zeppelin.apache.org/

  • GitHub repo gonum

    Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

    Project mention: Go+: Go designed for data science | news.ycombinator.com | 2021-03-27

    Apart from Gonum[1] numerical libraries, I haven't found specific data science related Go libraries in my search for it for some hobby projects when compared to Python ecosystem.

    Interestingly Prose[2] A Go library for text processing yielded better results for named-entity extraction when compared to NLTK in my tests in terms of accuracy and obviously performance.

    Perhaps Go is not being applied enough in the Data Science/ML and for fields where it's applied (Network) Math in the standard library seems to be sufficient.

    [1] https://github.com/gonum/gonum

    [2] https://github.com/jdkato/prose

  • GitHub repo BigDL

    BigDL: Distributed Deep Learning Framework for Apache Spark

    Project mention: Machine learning on JVM | reddit.com/r/scala | 2021-04-05

    Intel BigDL for Spark which again is for Spark.

  • GitHub repo Breeze

    Breeze is a numerical processing library for Scala.

    Project mention: Machine learning on JVM | reddit.com/r/scala | 2021-04-05

    I haven't checked in on this project in a long time, but Breeze is something akin to NumPy/SciPy.

  • GitHub repo Spark Notebook

    Interactive and Reactive Data Science using Scala and Spark.

  • GitHub repo blaze

    NumPy and Pandas interface to Big Data

  • GitHub repo orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: No-code vs Visual Programming | reddit.com/r/nocode | 2021-03-12

    I am using visual programming tools that overlap with the no-code concept such as: KNIME and Orange. To visualize the results, I use connectors with platforms like DataStudio or Google AppSheet.

  • GitHub repo Biopython

    Official git repository for Biopython (originally converted from CVS)

    Project mention: Need help with Biopython examples | reddit.com/r/bioinformatics | 2021-04-16

    You can then copy the contents of the file directly ( press Ctrl + A on the webpage, and then Ctrl + C on the text editor you are using ). You can also download the file using a command line tool, such as wget or curl if you are familiar with those. For example, if I wanted to download the ls_orchid.gbk file, I would find the raw version as above and simply open a terminal and type:

  • GitHub repo astropy

    Repository for the Astropy core package

    Project mention: Q&A: Month of April | reddit.com/r/Andromeda321 | 2021-04-06

    Cool, that sounds like a great place to start in terms of specialties! :) When in doubt for astronomy and coding, I advise people to know Python and the more the better, because that's really become the default in astronomical software in recent years. Poke around astropy a bit too while you're at it.

  • GitHub repo Algebird

    Abstract Algebra for Scala

    Project mention: Symbolics.jl: A Modern Computer Algebra System for a Modern Language | news.ycombinator.com | 2021-03-05

    Hey, I have... I'm a co-author of Algebird[0], which has many ideas that I'd pull over.

    I'm hoping to introduce Clojure's "spec" or "schema" libraries so that the types at play can at least be inspectable inside the system. In a fully typed language, I'd implement the extensible generics as typeclasses.

    I suspect it would make it quite a bit tougher (at least in the approach I'm imagining) for folks to write new generic functions, due to many type constructors...

    On the other hand, the complexity is there, even if you don't write it down!

    It would be a big project, and a worthy effort, to write down types for everything in SICM.

    [0] https://github.com/twitter/algebird

  • GitHub repo Stats

    A well tested and comprehensive Golang statistics library package with no dependencies. (by montanaflynn)

  • GitHub repo Interactive Parallel Computing with IPython

    Interactive Parallel Computing in Python

  • GitHub repo gonum/plot

    A repository for plotting and visualizing data

    Project mention: Go matplotlib libary? | reddit.com/r/golang | 2021-04-01

    Gonum Plot is alright but definitely not as mature.Link

  • GitHub repo Spire

    Powerful new number types and numeric abstractions for Scala.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-04-21.


What are some of the best open-source Science and Data analysis projects? This list will help you:

Project Stars
1 Pandas 29,374
2 NumPy 16,910
3 NetworkX 8,955
4 Dask 8,215
5 SciPy 8,126
6 SymPy 8,016
7 Numba 6,414
8 statsmodels 6,234
9 PyMC 5,699
10 Zeppelin 5,208
11 gonum 4,793
12 BigDL 3,718
13 Breeze 3,228
14 Spark Notebook 3,027
15 blaze 2,942
16 orange 2,699
17 Biopython 2,679
18 astropy 2,678
19 Algebird 2,049
20 Stats 1,952
21 Interactive Parallel Computing with IPython 1,939
22 gonum/plot 1,867
23 Spire 1,622