Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work. Learn more →
Top 23 Python Data Projects
-
-
airbyte
Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Project mention: airbyte VS cloudquery - a user suggested alternative | libhunt.com/r/airbyte | 2023-06-02 -
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
-
knowledge-repo
A next-generation curated knowledge sharing platform for data scientists and other technical professions.
While a start, a few that just being a markdown is editor is not enough, GitHub and GitLab already have this sort of wiki. I feel something like https://github.com/airbnb/knowledge-repo provides a better experience, since it gives an incentive for Data Scientists to make their source notebook well documented, and be a SSoT. With a Wiki like, if you change something on the original project, you need to remind yourself to update your reports. If your notebook is in itself your report, that's not necessary. Plus, it would benefit from the Semantic Diffs that DagsHub already have implemented.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Project mention: Data sources episode 2: AWS S3 to Postgres Data Sync using Singer | dev.to | 2023-05-04Link to original blog: https://www.mage.ai/blog/data-sources-ep-2-aws-s3-to-postgres-data-sync-using-singer
-
Mimesis
Mimesis is a robust data generator for Python, capable of rapidly producing large volumes of synthetic data for various use cases.
Project mention: Mimesis allows you toeasily generate detailed dummy datasets. | /r/datascience | 2023-04-12Mimesis has well-structured and comprehensive documentation: https://mimesis.name
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
Project mention: TensorFlow Datasets (TFDS): a collection of ready-to-use datasets | news.ycombinator.com | 2022-12-21
I tried Librispeech, a very common dataset for speech recognition, in both HF and TFDS.
TFDS performed extremely bad.
First it failed because the official hosting server only allows 5 simultaneous connections, and TFDS totally ignored that and makes up to 50 simultaneous downloads and that breaks. I wonder if anyone actually tested this?
Then you need to have some computer with 30GB to do the preparation, which might fail on your computer. This is where I stopped. https://github.com/tensorflow/datasets/issues/3887. It might be fixed now but it took them 8 months to respond to my issue.
On HF, it just worked. There was a smaller issue in how the dataset was split up but that is fixed now, and their response was very fast and great.
-
CKAN
CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
Project mention: Metadata Store - Which one to Choose ? OpenMetadata vs Datahub ? | /r/dataengineering | 2022-11-17We use Kubernetes as our deployment platform. Any feedback on one of these open source data catalogs ? - https://atlas.apache.org/#/ - https://opendatadiscovery.org/ - https://open-metadata.org/ - https://marquezproject.github.io/marquez/ - https://datahubproject.io/ - https://www.amundsen.io/ - https://ckan.org/ - https://magda.io/
-
flyte
Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01Anyone tried Flyte?
-
-
I've looked at https://github.com/pydata/pandas-datareader and it looks good, does anyone have experience?
-
Project mention: Kotlin/Java/Javascript/Scala users do you miss the ability to chain functional operators in Python? | /r/learnpython | 2022-06-05
I often endup using [pyfunctional](https://github.com/EntilZha/PyFunctional) which gives the ability of using chained functional operators, but I am still kind of sad this is not a builtin solution in Python (please note I am not involved in the pyfunctional project and I do not know the author)
-
PyPika
PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.
-
mara-pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
-
This morning I added a "Related Projects" [3] Section to the Buckaroo docs. If Buckaroo doesn't solve your problem, look at one of the other linked projects (like Mito).
[1] https://github.com/approximatelabs/sketch
-
I think the biggest area for growth for LLM based tools for data analysis is around helping users _understand what edits they actually made_.
I'm a co-founder of a non-AI data code-gen tool for data analysis -- but we also have a basic version of an LLM integration. The problem we see with tooling like Pandas AI (in practice! with real users at enterprises!) is that users make an edit like "remove NaN values" and then get a new dataframe -- but they have no way of checking if the edited dataframe is actually what they want. Maybe the LLM removed NaN values. Maybe it just deleted some random rows!
The key here: how can users build an understanding of how their data changed, and confirm that the changes made by the LLM are the changes they wanted. In other words, recon!
We've been experimenting more with this recon step in the AI flow (you can see the final PR here: https://github.com/mito-ds/monorepo/pull/751). It takes a similar approach to the top comment (passing a subset of the data to the LLM), and then really focuses in the UI around "what changes were made." There's a lot of opportunity for growth here, I think!
Any/all feedback appreciated :)
-
People that work on resolve also work on https://github.com/colour-science/colour which is most REFERENCE code and complete support there is.
-
diffgram
Integrate Human Supervision into your Platform. For all Training Data Types, Image, Video, 3D, Text, Geo, Audio, Compound, Grid, LLM, GPT, Conversational, and more.
Project mention: Open source tool scrapes git commit email addresses to send spam to. | /r/programming | 2022-07-29 -
glom
☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️ (by mahmoud)
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Data related posts
- Postgres to Postgres incremental (timestamp) sync etl tool
- How useful is Airbytes in production pipelines?
- Show HN: Loofi – Our AI-Powered SQL Query Builder
- Sub library with useful code
- [OC] World Gold Reserves by Institution
- Pandas AI – The Future of Data Analysis
- Do we need Spark if we're just loading data from source to target? Trying to move away from Informatica
-
A note from our sponsor - Sonar
www.sonarsource.com | 3 Jun 2023
Index
What are some of the best open-source Data projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Prefect | 12,036 |
2 | airbyte | 10,796 |
3 | chinese-xinhua | 10,079 |
4 | akshare | 6,578 |
5 | knowledge-repo | 5,328 |
6 | Mage | 4,715 |
7 | Mimesis | 3,972 |
8 | datasets | 3,885 |
9 | CKAN | 3,823 |
10 | flyte | 3,420 |
11 | TextRecognitionDataGenerator | 2,682 |
12 | pandas-datareader | 2,656 |
13 | PyFunctional | 2,176 |
14 | PyPika | 2,023 |
15 | mara-pipelines | 2,003 |
16 | sketch | 1,785 |
17 | monorepo | 1,730 |
18 | Colour | 1,708 |
19 | diffgram | 1,706 |
20 | glom | 1,697 |
21 | Cubes | 1,489 |
22 | pycm | 1,382 |
23 | pyjanitor | 1,147 |