Python Science and Data analysis

Open-source Python projects categorized as Science and Data analysis

Top 23 Python Science and Data analysis Projects

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide | dev.to | 2023-08-20

    AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.

  • NumPy

    The fundamental package for scientific computing with Python.

    Project mention: Calculating weighted averages with numpy and Python! | dev.to | 2023-08-22

    numpy

  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

  • NetworkX

    Network Analysis in Python

    Project mention: org-roam-pygraph: Build a graph of your org-roam collection for use in Python | /r/orgmode | 2023-05-07

    org-roam-ui is a great interactive visualization tool, but its main use is visualization. The hope of this library is that it could be part of a larger graph analysis pipeline. The demo provides an example graph visualization, but what you choose to do with the resulting graph certainly isn't limited to that. See for example networkx.

  • SciPy

    SciPy library main repository

    Project mention: Fortran codes are causing problems | /r/rstats | 2023-09-13

    Fortran codes have caused many problems for the Python package Scipy, and some of them are now being rewritten in C: e.g., https://github.com/scipy/scipy/pull/19121. Not only does R have many Fortran codes, there are also many R packages using Fortran codes: https://github.com/r-devel/r-svn, https://github.com/cran?q=&type=&language=fortran&sort=. Modern Fortran is a fine language but most legacy Fortran codes use the F77 style. When I update the R package quantreg, which uses many Fortran codes, I get a lot of warning messages. Not sure how the Fortran codes in the R ecosystem will be dealt with in the future, but they recently caused an issue in R due to the lack of compiler support for Fortran: https://blog.r-project.org/2023/08/23/will-r-work-on-64-bit-arm-windows/index.html. Some renowned packages like glmnet already have their Fortran codes rewritten in C/C++: https://cran.r-project.org/web/packages/glmnet/news/news.html

  • Dask

    Parallel computing with task scheduling

    Project mention: The Distributed Tensor Algebra Compiler (2022) | news.ycombinator.com | 2023-06-15
  • SymPy

    A computer algebra system written in pure Python

    Project mention: Solving a simple puzzle using SymPy | news.ycombinator.com | 2023-08-14

    bug report opened https://github.com/sympy/sympy/issues/25507

  • Numba

    NumPy aware dynamic Python compiler using LLVM

    Project mention: Is anyone using PyPy for real work? | news.ycombinator.com | 2023-07-31

    Simulations are, at least in my experience, numba’s [0] wheelhouse.

    [0]: https://numba.pydata.org/

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: statsmodels Release Candidate 0.14.0rc0 tagged | /r/Python | 2023-04-26
  • PyMC

    Bayesian Modeling in Python

    Project mention: PYMC Release: v5.0.0 | news.ycombinator.com | 2022-12-12
  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: What exactly is AutoGPT? | /r/AutoGPT | 2023-06-12

    Both tools are ripoffs of a data mining framework named Orange 3

  • astropy

    Astronomy and astrophysics core library

    Project mention: [R] Astronomia ex machina: a history, primer and outlook on neural networks in astronomy | /r/MachineLearning | 2023-05-31
  • Biopython

    Official git repository for Biopython (originally converted from CVS)

    Project mention: Invitación a proyecto - Biopython en Español | /r/devsarg | 2023-07-23
  • blaze

    NumPy and Pandas interface to Big Data

  • fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

    Project mention: Daft: A High-Performance Distributed Dataframe Library for Multimodal Data | news.ycombinator.com | 2023-06-07

    Please integrate it with Fugue.

    https://github.com/fugue-project/fugue

  • Cubes

    [NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis

  • bcbio-nextgen

    Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

    Project mention: Deep Sleep May Be the Best Defense Against Alzheimer’s | news.ycombinator.com | 2023-05-22

    Re WGS there are a lot of well established tool chains that are FLOSS (eg https://github.com/bcbio/bcbio-nextgen). You could run alignment and variant calling on a beefy workstation. A laptop would potentially work. Easy to test this with publicly available raw data. Another option: The sequencing provider often will run alignment and some default variant calling for you. Annotating and analysing these variants can be done on pretty much any computer, all with open source software. A SNP chip is even easier to deal with as the computational requirements are less.

    Interpreting the results is a more manual process. Really depends on what you are interested in.

  • Neupy

    NeuPy is a Tensorflow based python library for prototyping and building neural networks

  • NIPY

    Workflows and interfaces for neuroimaging packages

  • bccb

    Incubator for useful bioinformatics code, primarily in Python and R

  • Bubbles

    [NOT MAINTAINED] Bubbles – Python ETL framework (by Stiivi)

  • PyDy

    Multibody dynamics tool kit.

  • harold

    An open-source systems and controls toolbox for Python3

  • signac

    Manage large and heterogeneous data spaces on the file system.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-13.

Python Science and Data analysis related posts

Index

What are some of the best open-source Science and Data analysis projects in Python? This list will help you:

Project Stars
1 Pandas 39,797
2 NumPy 24,540
3 NetworkX 13,177
4 SciPy 11,726
5 Dask 11,398
6 SymPy 11,317
7 Numba 8,913
8 statsmodels 8,866
9 PyMC 7,783
10 orange 4,259
11 astropy 3,921
12 Biopython 3,754
13 blaze 3,165
14 fugue 1,723
15 Cubes 1,490
16 bcbio-nextgen 949
17 Neupy 741
18 NIPY 701
19 bccb 575
20 Bubbles 450
21 PyDy 333
22 harold 163
23 signac 124
Write Clean Python Code. Always.
Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
www.sonarsource.com