Top 23 Python Data Analysis Projects
scikit-learn: machine learning in PythonProject mention: Data Science toolset summary from 2021 | dev.to | 2021-11-13
Scikit-learn - It is one of the most widely used frameworks for Python based Data science tasks. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Link - https://scikit-learn.org/
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much moreProject mention: Why df.drop_duplicates() doesn't work for me? | reddit.com/r/learnpython | 2021-12-02
There is discussion of the inplace argument becoming deprecated across the entire pandas API.
Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
Streamlit — The fastest way to build data apps in PythonProject mention: Suggestions for GUI framework for an app to browse tables of data, with buttons and dropdown menus in cells? And some related PySimpleGui questions | reddit.com/r/learnpython | 2021-11-07
I've never used it, but someone suggested it in another thread, and it looked interesting to me, so I have it bookmarked to try-out: https://streamlit.io/
Statsmodels: statistical modeling and econometrics in PythonProject mention: Advice required to choose appropriate software for an assignment | reddit.com/r/econometrics | 2021-04-26
Can't you get a student discount for Stata? R would definitely be able to handle everything. For Python, have a look through the statsmodel package https://github.com/statsmodels/statsmodels
A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)Project mention: [D] Unsupervised Outlier Detection - Advise Requested | reddit.com/r/MachineLearning | 2021-12-03
The source code and documentaion of PyOD is the best survey about OOD. Besides, the normalized flow and VQVAE are also feasible.
A next-generation curated knowledge sharing platform for data scientists and other technical professions.Project mention: How does everyone share their models etc. across teams for re-use effectively? | reddit.com/r/datascience | 2021-05-22
Run Linux Software Faster and Safer than Linux with Unikernels.
Create UIs for your machine learning model in Python in 3 minutesProject mention: AnimeGan V2 | news.ycombinator.com | 2021-11-07
Missing data visualization module for Python.Project mention: For all the python/pandas users out there I just released a bunch of UI updates to the free visualizer, D-Tale | reddit.com/r/algotrading | 2021-04-12
analysis of "Missing" data using the missingno package is now available in a sliding side panel enlarge or download PNG files for matrix/bar/heatmap/dendrogram charts generated using missingno
a delightful machine learning tool that allows you to train, test, and use models without writing codeProject mention: Train/fit, test, and use models without writing code | reddit.com/r/ArtificialInteligence | 2021-06-29
Link to the repo: https://github.com/nidhaloff/igel
A grammar of graphics for PythonProject mention: Book Recommendations Matlab->Python? | reddit.com/r/engineering | 2021-11-02
plotnine is a Python port of ggplot.
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).Project mention: Redshift API vs. other ways to connect? | reddit.com/r/datascience | 2021-10-21
awslabs has developed their own package for this and given it's for their product, seem likely to maintain it. https://github.com/awslabs/aws-data-wrangler
Extract data from a wide range of Internet sources into a pandas DataFrame.Project mention: Best quantitative tools/repos/apis for Sentiment & Social Media analysis of individual Stock/Crypto tickers | reddit.com/r/algotrading | 2021-07-03
Also Yahoo continually takes steps to discourage programmatic access (the most recent attempt is happening right now: https://github.com/pydata/pandas-datareader/issues/868).
Visualize and compare datasets, target values and associations, with one line of code.Project mention: Automated Data Profiling and Attribute Clustering using unsupervised ML techniques | reddit.com/r/datascience | 2021-07-03
Take a look at this package which computes associations between variables and other viz and can infer some types https://github.com/fbdesignpro/sweetviz
Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.Project mention: Hacktoberfest: Flytesnacks Project "update tuple output examples" | dev.to | 2021-11-01
I chose the flytekit project, which is one of the component repos of flyte and is the python SDK and tools of the Flyte project
Light-weight Python OLAP framework for multi-dimensional data analysisProject mention: Building data analysis apps | reddit.com/r/Python | 2021-04-16
I'm looking for materials and tools to learn. I'm reading up on OLAP and cubes. I found cubes python package but it hasn't been updated in years. Could you give me some tips on what to learn in 2021?
Next-gen framework for backtesting, algorithmic trading, and research. Blazingly fast. Super accurate. Pandas friendly.Project mention: Looking for active python backtesting framework | reddit.com/r/algotrading | 2021-02-09
However, it's not the fastest framework. If you need speed, and are good with the data science tool chain in python and the concept of flattening loops into vectorized operations, check out vector-bt. I haven't gotten a chance to play with it yet, but I'm definitely going to as soon as I find some spare time. It seems like a great option with a nicely modernized approach.
Multi-class confusion matrix library in PythonProject mention: [P] PyCM 3.3 released: Comparison of Classifiers Based on Confusion Matrix | reddit.com/r/MachineLearning | 2021-10-27
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
NFStream: a Flexible Network Data Analysis Framework.Project mention: Open Source Deep Packet Inspection Using Python | news.ycombinator.com | 2021-07-02
GitHub project: https://github.com/nfstream/nfstream
Community feedbacks and contributions are welcome!
Python library for using dplyr like syntax with pandas and SQLProject mention: Going from R to Pandas: dplython vs dfply vs plydata | reddit.com/r/datascience | 2021-09-30
You should follow /u/the75th's advice. However, if you decide to buck that take, I'd look into siuba. I've never heard of those packages you've listed, and have doubts they'd be maintained.
What's in your data? Extract schema, statistics and entities from datasetsProject mention: [P] DataProfiler - Scaleable Sensitive Data Detection & Analysis on Structured & Unstructured Files | reddit.com/r/MachineLearning | 2021-11-22
re_data - fix data issues before your users & CEO would discover them 😊Project mention: great_expectations VS redata - a user suggested alternative | libhunt.com/r/great_expectations | 2021-09-24
It's more convenient when you are already using dbt and don't want to set up a separate workflow for testing data when it can be done with dbt inside the data warehouse. Also the thing re_data does well is letting you create time-based metrics about your data quality instead of just tests (a lot of the tests can be rewritten to that) That allows you to do a couple of things more than GE, you can for example easily visualize or look for anomalies in those. You can also compute tests much more efficiently. Research about computing metrics as a good way of doing data quality was actually done by the team behind deequ: http://www.vldb.org/pvldb/vol11/p1781-schelter.pdf I'm the author, so obviously I'm a bit biased :)
Python Data Analysis related posts
Why df.drop_duplicates() doesn't work for me?
2 projects | reddit.com/r/learnpython | 2 Dec 2021
Trying to create a loop and print the values on a table. (Beginner)
1 project | reddit.com/r/learnpython | 2 Dec 2021
How to automate financial data collection and storage in CrateDB with Python and pandas
1 project | dev.to | 25 Nov 2021
It annoys me how people blame students for majoring in the wrong majors
1 project | reddit.com/r/lostgeneration | 22 Nov 2021
Should I do a CompSci course or just keep practicing my Python?
1 project | reddit.com/r/learnpython | 21 Nov 2021
[Pandas] Struggling to see what these lines achieve, any help appreciated.
1 project | reddit.com/r/Cython | 18 Nov 2021
Launch HN: Metaplane (YC W20) – Datadog for Data
6 projects | news.ycombinator.com | 15 Nov 2021
What are some of the best open-source Data Analysis projects in Python? This list will help you:
|12||AWS Data Wrangler||2,339|
Are you hiring? Post a new remote job listing for free.