Pandas
Airflow
Our great sponsors
Pandas | Airflow | |
---|---|---|
307 | 143 | |
36,692 | 29,004 | |
1.0% | 1.5% | |
10.0 | 10.0 | |
5 days ago | 1 day ago | |
Python | Python | |
BSD 3-clause "New" or "Revised" License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Pandas
-
How to query pandas DataFrames with SQL
Pandas is a go-to tool for tabular data management, processing, and analysis in Python, but sometimes you may want to go from pandas to SQL.
-
What are the best Python libraries to learn for beginners?
panda’s
-
Replacing Pandas with Polars. A Practical Guide
> The big thing pandas has going for it is that it's already been through this field testing. All the bugs have been ironed out by the hundreds of thousands of users.
At this very moment pandas github repo has 1563 open issues labeled with a bug tag [0]. So much for "all the bugs have been ironed out".
[0] https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3...
-
Joining the Open Source Development Course
Python is the main programming language I use nowadays. In particular numpy and pandas are of course extremely useful. I also use biopython package - a collection of software tools for biological computation written in Python by an international group of researchers and developers.
-
Pandas VS Rath - a user suggested alternative
2 projects | 12 Jan 2023
-
Twitter Data Pipeline with Apache Airflow + MinIO (S3 compatible Object Storage)
Below is the python Task that transforms the tweets list into a Pandas dataframe, then dumps it in our MinIO Object Storage as a CSV file:
-
Hanukkah of Data 2022 - Puzzle 2
It was rewarding to dig into SQLite a bit while solving this puzzle, so I figured this would be a good opportunity to learn a bit more about pandas too! So how would I adapt this working SQL solution to pandas?
- ETL using pandas
-
Tor Network Statistics & Performance [OC]
All the data has been extracted from the official Tor Metrics website, and using Python with the Pandas library I've cleaned the data. Finally, the visualizations have been made with Tableau.
-
How to take inputs from an ascii file in Python
If you did that you could use a built-in library like csv to read and parse the file or you could use a 3rd party library like Pandas. Alternatively, you could store your file as json:
Airflow
-
Building a Data Lakehouse for Analyzing Elon Musk Tweets using MinIO, Apache Airflow, Apache Drill and Apache Superset
💡 You can read more here.
-
How do you manage scheduled tasks?
Its a bit overkill but i use Airflow with local executor.
-
Twitter Data Pipeline with Apache Airflow + MinIO (S3 compatible Object Storage)
To learn more about it, I built a Data Pipeline that uses Apache Airflow to pull Elon Musk tweets using the Twitter API and store the result in a CSV stored in a MinIO (OSS alternative to AWS s3) Object Storage bucket.
-
Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano
Airflow
- self hosted Alternative to easycron.com?
-
Azure OAuth CSRF State Not Equal Error
I am currently having a problem with trying to enable Azure OAuth to authenticate into our airflow instance. I have posted in countless other places trying to get answers so this is my next place I am trying. Here is the link to the discussion I posted within the airflow repo: https://github.com/apache/airflow/discussions/28098 but I will also do the liberty of posting it here as well. If anybody has any knowledge or can help I would greatly appreciate it as I have been dealing with this for over a month with no answers.
-
ETL tool
Airflow is really popular, started at Airbnb. Pros: huge community, super mature. Cons: generic workflow orchestration, not the best for handling only data, hard to scale and maintain.
-
How to do distributed cronjobs with worker queues?
Airflow might also be a good option for you. Essentially DAGs of cronjobs. We like it a lot.
-
Airflow :: Deploy Apache Airflow on Rancher K3s
$ helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace Release "airflow" does not exist. Installing it now. NAME: airflow LAST DEPLOYED: Sun Nov 6 02:06:55 2022 NAMESPACE: airflow STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing Apache Airflow 2.4.1! Your release is named airflow. You can now access your dashboard(s) by executing the following command(s) and visiting the corresponding port at localhost in your browser: Airflow Webserver: kubectl port-forward svc/airflow-webserver 8080:8080 --namespace airflow Default Webserver (Airflow UI) Login credentials: username: admin password: admin Default Postgres connection credentials: username: postgres password: postgres port: 5432 You can get Fernet Key value by running the following: echo Fernet Key: $(kubectl get secret --namespace airflow airflow-fernet-key -o jsonpath="{.data.fernet-key}" | base64 --decode) ########################################################### # WARNING: You should set a static webserver secret key # ########################################################### You are using a dynamically generated webserver secret key, which can lead to unnecessary restarts of your Airflow components. Information on how to set a static webserver secret key can be found here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key
-
Duct Size vs. Airflow (2012)
I gotta admit, my first thought was "Duct Size" is a weird name for a distributed work-flow tool[1].
What are some alternatives?
dagster - An orchestration platform for the development, production, and observation of data assets.
Kedro - A Python framework for creating reproducible, maintainable and modular data science code.
Cubes - Light-weight Python OLAP framework for multi-dimensional data analysis
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
orange - 🍊 :bar_chart: :bulb: Orange: Interactive data analysis
n8n - Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
Dask - Parallel computing with task scheduling
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
tensorflow - An Open Source Machine Learning Framework for Everyone
airbyte - Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Apache Camel - Apache Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
argo - Workflow engine for Kubernetes