Python Data

Open-source Python projects categorized as Data | Edit details

Top 23 Python Data Projects

  • knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How does everyone share their models etc. across teams for re-use effectively? | reddit.com/r/datascience | 2021-05-22
  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • Mimesis

    Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.

  • CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

    Project mention: Software and tools for (non-human) genomics data platform | reddit.com/r/bioinformatics | 2022-01-18

    Our first instinct is to use [CKAN](https://ckan.org) for cataloging (and storage, with modifications), especially since we know it and know that it has been used successfully elsewhere. However, we suspect that more specialized/better tools exist for this, thus why I kindly ask for your insights.

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Best quantitative tools/repos/apis for Sentiment & Social Media analysis of individual Stock/Crypto tickers | reddit.com/r/algotrading | 2021-07-03

    Also Yahoo continually takes steps to discourage programmatic access (the most recent attempt is happening right now: https://github.com/pydata/pandas-datareader/issues/868).

  • TextRecognitionDataGenerator

    A synthetic data generator for text recognition

    Project mention: [D] How to generate syntactically correct text examples for CRNN-CTC | reddit.com/r/MachineLearning | 2021-12-17

    [1]: https://github.com/Belval/TextRecognitionDataGenerator

  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • flyte

    Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.

    Project mention: Flyte 1.0 | news.ycombinator.com | 2022-04-26
  • PyFunctional

    Python library for creating data pipelines with chain functional programming

  • mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

    Project mention: How to keep track of the different Transformations done in an ETL pipeline? | reddit.com/r/dataengineering | 2021-08-22

    The closest I've found is Mara but not what I'm after.

  • PyPika

    PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

    Project mention: Write an SQL query builder in 150 lines of Python | news.ycombinator.com | 2021-08-27

    https://github.com/kayak/pypika

    Have used in multiple projects and have found it's the right balance between ORMs and writing raw SQL. It's also easily extensible and takes care of the many edge cases and nuances of rolling your own SQL generator.

  • glom

    ☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️ (by mahmoud)

    Project mention: Is there a quicker way to check if a attribute within an attribute exists? | reddit.com/r/learnpython | 2021-09-12

    If your project requires writing this sort of code a lot, there are third-party libraries that can make it a bit easier. One example that comes to mind is glom (take a look at their tutorial).

  • Colour

    Colour Science for Python

    Project mention: The Color of Infinite Temperature | news.ycombinator.com | 2022-01-16

    I haven’t seen the math for the conversion but the conversion from CCT to xy/uv are given for a particular domain. One of the conversion with the largest domain, i.e. Ohno m, covers domain [1000K, 100000K]: https://github.com/colour-science/colour/blob/develop/colour...

    Infinity is very much in extrapolation territory.

  • Cubes

    Light-weight Python OLAP framework for multi-dimensional data analysis

  • pycm

    Multi-class confusion matrix library in Python

    Project mention: PyCM 3.5 released: Multi-class confusion matrix library in Python | reddit.com/r/coolgithubprojects | 2022-04-27
  • pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

    Project mention: how important are learning the data manipulation libraries? | reddit.com/r/learndatascience | 2022-03-25
  • pdpipe

    Easy pipelines for pandas DataFrames.

  • pybaseball

    Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)

    Project mention: Data for every pitch (velo, location, movement) | reddit.com/r/Sabermetrics | 2022-05-11

    pybaseball's statcast function or baseballr has similar methods if you're in R.

  • metricflow

    MetricFlow allows you to define, build, and maintain metrics in code.

    Project mention: Show HN: MetricFlow – open-source metric framework | news.ycombinator.com | 2022-04-06

    Three things:

    First, MetricFlow does not currently support MySQL. We launched with support for BigQuery, Redshift, and Snowflake. I have opened an issue to add support for MySQL (and similar issues for other SQL engines are coming): https://github.com/transform-data/metricflow/issues/27

    Second, what we call a data source is more similar to a table in a database, rather than the underlying database service itself. Metricflow itself is useful when you're using a single SQL engine - indeed, that's all we support today - but it is most useful when you're in a world where joins are a thing. That said, if you have one big data table you might still find it useful to have declarative metric definitions defined in Metricflow. Suppose, for example, you had a big NoSQL style table filled with JSON objects. You might define a few data sources that normalize those JSON objects into top level elements (identifiers, dimensions, aggregated measures) using the sql_query data source config attribute, and then that'd allow you to support structured queries on the data consumption end while pushing unstructured blobs from your application layer. This will be slow at query time, and only as reliable as the level of discipline exerted in your application development workflow, but it's possible.

    Third, if we did support MySQL you'd basically connect to it via standard connection parameters - we have a config file where you can store the required information and then we'll manage the connections for you. However, I'm not familiar with uxwizz, and a quick perusal of their documentation did not turn up how one goes about connecting to the underlying DB. It's likely I just missed this, but at any rate I don't know how it is done. If they don't support standard MySQL client connections you'd need to write an adapter of some kind against whatever DB connection APIs they provide, in which case you'd likely need to roll a custom implementation of MetricFlow's SqlClient interface and initialize the MetricFlowEngine with that.

  • isp-data-pollution

    ISP Data Pollution to Protect Private Browsing History with Obfuscation

    Project mention: A browser extension that sends periodical random search requests, from a large distributed list, to confuse the data broker industry. | reddit.com/r/CrazyIdeas | 2022-04-12

    I’ve seen multiple implementations of this idea. Here’s the first that came up in a quick search: https://github.com/essandess/isp-data-pollution/

  • opal

    Policy and data administration, distribution, and real-time updates on top of Open Policy Agent (by permitio)

    Project mention: OPAL + OPA VS XACML | dev.to | 2022-05-19

    One such XACML alternative is OPA + OPAL. Open Policy Agent (OPA) is an open-source project created as a general-purpose policy engine to serve any policy enforcement requirements that unifies policy enforcement across the stack without being dependent on implementation details. It can be used with any language and network protocol, supports any data type, and evaluates and returns answers quickly. OPA’s policy rules are written in Rego - a high-level declarative (Datalog-like) language. You can find a more detailed introduction to OPA here. It’s important to note that OPA itself only provides an alternative to XACML’s PDP (More on that further). OPA is enhanced by OPAL (Open Policy Administration Layer) - another open-source solution that allows you to easily keep your authorization layer up-to-date in real-time. More information about the project is available here. The combination of OPA and OPAL provides a solid alternative for XACML.

  • monorepo

    The mitosheet package, trymito.io, and other public Mito code.

    Project mention: Mito – Excel-like interface for Pandas dataframes in Jupyter notebook | news.ycombinator.com | 2022-05-20

    Mito is open source, but using Pro features does actually require a Pro or enterprise license. You can check out this callout in the license [1], as well as the restrictions on Mito Pro features here [2]. We're in the process of fixing up the upgrade to Pro process a bit... as you can tell... :)

    You can of course fork Mito and turn off telemetry as long as you open source your changes! Go for it - happy to hop on a call and help you get set up with the codebase, if you want. Yay open source!

    [1] https://github.com/mito-ds/monorepo/blob/974091b455950c6c50e...

  • kuwala

    Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demograp

    Project mention: The Paradigm Shift of Business Models in the Data Space is Real | dev.to | 2022-03-16

    In summary, we are currently experiencing a renaissance of open-source software. Traditional SaaS business models in the data space are difficult to implement because data projects are too complex to solve in a proprietary tool for customers. I drew an example around no-code data platforms that rely on SaaS business models in particular but build on open-source without openly admitting it. This can lead to a problem in terms of customer satisfaction. That this is problematic is shown by the Log4j case as well as the Airbyte/Fivetran case study. Closed SaaS tools are not only expensive to build, but also do not address the complexity of customer problems. Because closed tools rely heavily on open-source libraries, non-transparent communication also creates major security vulnerabilities. There is a certain frustration and tension on the side of open-source contributors. This holds opportunities to change something, create completely new tools, business models, and software categories. And this is into what Kuwala turned now: an Open-Source No Code Data Platform. Give us a try on Github![https://github.com/kuwala-io/kuwala] You have a different perspective? Or would like to discuss it more in depth? You can easily join our discussion on slack, here[https://kuwala-community.slack.com/ssb/redirect] . Or just comment below.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-05-20.

Python Data related posts

Index

What are some of the best open-source Data projects in Python? This list will help you:

Project Stars
1 knowledge-repo 5,092
2 akshare 4,932
3 Mimesis 3,601
4 CKAN 3,387
5 datasets 3,255
6 pandas-datareader 2,321
7 TextRecognitionDataGenerator 2,267
8 flyte 2,259
9 PyFunctional 2,008
10 mara-pipelines 1,897
11 PyPika 1,645
12 glom 1,523
13 Colour 1,455
14 Cubes 1,446
15 pycm 1,240
16 pyjanitor 917
17 pdpipe 672
18 pybaseball 665
19 metricflow 583
20 isp-data-pollution 516
21 opal 515
22 monorepo 488
23 kuwala 474
Find remote jobs at our new job board 99remotejobs.com. There are 7 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com