Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work. Learn more →
Top 23 Python Data Projects
The easiest way to build, run, and monitor data pipelines at scale.Project mention: self hosted Alternative to easycron.com? | /r/selfhosted | 2022-12-30
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.Project mention: airbyte VS cloudquery - a user suggested alternative | libhunt.com/r/airbyte | 2023-06-02
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
A next-generation curated knowledge sharing platform for data scientists and other technical professions.Project mention: How do you document a ML research? | /r/mlops | 2022-08-30
While a start, a few that just being a markdown is editor is not enough, GitHub and GitLab already have this sort of wiki. I feel something like https://github.com/airbnb/knowledge-repo provides a better experience, since it gives an incentive for Data Scientists to make their source notebook well documented, and be a SSoT. With a Wiki like, if you change something on the original project, you need to remind yourself to update your reports. If your notebook is in itself your report, that's not necessary. Plus, it would benefit from the Semantic Diffs that DagsHub already have implemented.
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-aiProject mention: Data sources episode 2: AWS S3 to Postgres Data Sync using Singer | dev.to | 2023-05-04
Link to original blog: https://www.mage.ai/blog/data-sources-ep-2-aws-s3-to-postgres-data-sync-using-singer
Mimesis is a robust data generator for Python, capable of rapidly producing large volumes of synthetic data for various use cases.Project mention: Mimesis allows you toeasily generate detailed dummy datasets. | /r/datascience | 2023-04-12
Mimesis has well-structured and comprehensive documentation: https://mimesis.name
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)Project mention: TensorFlow Datasets (TFDS): a collection of ready-to-use datasets | news.ycombinator.com | 2022-12-21
I tried Librispeech, a very common dataset for speech recognition, in both HF and TFDS.
TFDS performed extremely bad.
First it failed because the official hosting server only allows 5 simultaneous connections, and TFDS totally ignored that and makes up to 50 simultaneous downloads and that breaks. I wonder if anyone actually tested this?
Then you need to have some computer with 30GB to do the preparation, which might fail on your computer. This is where I stopped. https://github.com/tensorflow/datasets/issues/3887. It might be fixed now but it took them 8 months to respond to my issue.
On HF, it just worked. There was a smaller issue in how the dataset was split up but that is fixed now, and their response was very fast and great.
CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.Project mention: Metadata Store - Which one to Choose ? OpenMetadata vs Datahub ? | /r/dataengineering | 2022-11-17
We use Kubernetes as our deployment platform. Any feedback on one of these open source data catalogs ? - https://atlas.apache.org/#/ - https://opendatadiscovery.org/ - https://open-metadata.org/ - https://marquezproject.github.io/marquez/ - https://datahubproject.io/ - https://www.amundsen.io/ - https://ckan.org/ - https://magda.io/
Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01
Anyone tried Flyte?
A synthetic data generator for text recognition
Extract data from a wide range of Internet sources into a pandas DataFrame.Project mention: Seeking recommendations for forex economic data API | /r/algotrading | 2023-05-03
I've looked at https://github.com/pydata/pandas-datareader and it looks good, does anyone have experience?
I often endup using [pyfunctional](https://github.com/EntilZha/PyFunctional) which gives the ability of using chained functional operators, but I am still kind of sad this is not a builtin solution in Python (please note I am not involved in the pyfunctional project and I do not know the author)
PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
AI code-writing assistant that understands data contentProject mention: Pandas AI – The Future of Data Analysis | news.ycombinator.com | 2023-05-17
This morning I added a "Related Projects"  Section to the Buckaroo docs. If Buckaroo doesn't solve your problem, look at one of the other linked projects (like Mito).
The mitosheet package, trymito.io, and other public Mito code.Project mention: Pandas AI – The Future of Data Analysis | news.ycombinator.com | 2023-05-17
I think the biggest area for growth for LLM based tools for data analysis is around helping users _understand what edits they actually made_.
I'm a co-founder of a non-AI data code-gen tool for data analysis -- but we also have a basic version of an LLM integration. The problem we see with tooling like Pandas AI (in practice! with real users at enterprises!) is that users make an edit like "remove NaN values" and then get a new dataframe -- but they have no way of checking if the edited dataframe is actually what they want. Maybe the LLM removed NaN values. Maybe it just deleted some random rows!
The key here: how can users build an understanding of how their data changed, and confirm that the changes made by the LLM are the changes they wanted. In other words, recon!
We've been experimenting more with this recon step in the AI flow (you can see the final PR here: https://github.com/mito-ds/monorepo/pull/751). It takes a similar approach to the top comment (passing a subset of the data to the LLM), and then really focuses in the UI around "what changes were made." There's a lot of opportunity for growth here, I think!
Any/all feedback appreciated :)
Colour Science for PythonProject mention: Colorists work in Resolve? | /r/colorists | 2023-04-30
People that work on resolve also work on https://github.com/colour-science/colour which is most REFERENCE code and complete support there is.
Integrate Human Supervision into your Platform. For all Training Data Types, Image, Video, 3D, Text, Geo, Audio, Compound, Grid, LLM, GPT, Conversational, and more.Project mention: Open source tool scrapes git commit email addresses to send spam to. | /r/programming | 2022-07-29
☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️ (by mahmoud)
Light-weight Python OLAP framework for multi-dimensional data analysis
Multi-class confusion matrix library in PythonProject mention: PyCM 3.9 Released: Log-loss Support | /r/coolgithubprojects | 2023-05-04
Clean APIs for data cleaning. Python implementation of R package JanitorProject mention: Sub library with useful code | /r/learnpython | 2023-05-19
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Data related posts
Postgres to Postgres incremental (timestamp) sync etl tool
1 project | /r/dataengineer | 30 May 2023
How useful is Airbytes in production pipelines?
1 project | /r/dataengineering | 29 May 2023
Show HN: Loofi – Our AI-Powered SQL Query Builder
1 project | news.ycombinator.com | 21 May 2023
Sub library with useful code
3 projects | /r/learnpython | 19 May 2023
[OC] World Gold Reserves by Institution
1 project | /r/dataisbeautiful | 18 May 2023
Pandas AI – The Future of Data Analysis
7 projects | news.ycombinator.com | 17 May 2023
Do we need Spark if we're just loading data from source to target? Trying to move away from Informatica
1 project | /r/dataengineering | 16 May 2023
A note from our sponsor - Sonar
www.sonarsource.com | 3 Jun 2023
What are some of the best open-source Data projects in Python? This list will help you: