Airflow
kaggle-environments | Airflow | |
---|---|---|
55 | 180 | |
289 | 36,634 | |
0.3% | 1.6% | |
7.9 | 10.0 | |
4 days ago | 3 days ago | |
Jupyter Notebook | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
kaggle-environments
- Data Science Roadmap with Free Study Material
-
Help needed! My first hackathon
If you are interested in Data Science, you may want to look at Kaggle competitions. https://www.kaggle.com/competitions
- What's a statistical / research methodology, that's not usually taught in grad programs, that you think more IO's should be aware about?
-
Freaking out about how I’m inexperienced to land an internship and eventually a job
Secondly, if you feel like you do not have enough skills or a lack of practice answering problem statements, there are a lot of good websites where you can find interesting projects. I would recommend starting participating in some Kaggle competitions or download some free Google datasets and start playing with them.
-
Capitalism provides half-assed solutions to extinction-level problems caused by capitalism
For reference: Kaggle is a Google product. You can see the list of current competitions here.
- Where can neural networks take me? - Semi-existential crisis
-
What Can I Do With My Time as a Substitute for Strategy Computer Games?
You could try Kaggle competitions, or participating in forecasting markets (as you stated) is another option. You don't need any specific skill set to be a forecaster, the rules of the bet are stipulated and from there it's just based on your ability to predict the outcome. You could also try your hand at investing in the stock market, or try and make money betting on sports games. If you're very good at this stuff I'm sure you can make a lot of money doing it. The thing to keep in mind is that generally video games are much much easier than real life
-
What is the best advanced professional certification for Data Science/ML/DL/MLOps?
As to the specifics of your projects, that's up to you. Try browsing Kaggle; check out some of the work we have on The Pudding; check out some journalism examples to see what you can try to build on or improve.
- Suggestions for projects on kaggle for cv?
-
Hi! Im doing research on AI innovation. Does anybody know any specific platform where I can learn/understand and get case studies or on-going projects that companies are implementing? Thanks for your help!
You might want to look at kaggle competitions.
Airflow
-
Enabling Apache Airflow to copy large S3 objects
This approach means the API doesn't change, i.e., you can just replace the S3CopyObjectOperator instances with S3CopyOperator instances. Additionally, we only perform the extra work of doing the multipart upload when the simpler method is insufficient. The trade-off is that we're inefficient if almost every object is larger than 5GB because we're doing a "useless" API call first. As usual, it depends. A similar approach has been discussed in this Github Issue in the Airflow repository.
-
Deploy Apache Airflow on AWS Elastic Kubernetes Service (EKS)
helm repo add apache-airflow https://airflow.apache.org
-
New Apache Airflow Operators for Google Generative AI
We only use KubernetesOperators, but this has many downsides, and it's very clearly a 2nd thought of the Airflow project. It creates confusion because users of Airflow expect features A, B, and C, and when using KubernetesOperators they aren't functional because your biz logic needs to be separated. There are a number of blog posts echoing a similar critique[1]. Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.
According to their README: https://github.com/apache/airflow#approach-to-dependencies-o...
-
Anyone Can Access Deleted and Private Repository Data on GitHub
> Nope, me too. The whole Repo network thing is not User facing at all.
There are some user-facing parts: You can find the fork network and some related bits under repo insights. (The UX is not great.)
https://github.com/apache/airflow/forks?include=active&page=...
-
Data on Kubernetes: Part 3 - Managing Workflows with Job Schedulers and Batch-Oriented Workflow Orchestrators
There are several tools available that can help manage these workflows. Apache Airflow is a platform designed to programmatically author, schedule, and monitor workflows.
-
Ask HN: What's the right tool for this job?
From what I've seen, there are sort of two paths. I'll provide a well known example from each.
1. lang specific distributed task library
For example, in Python, celery is a pretty popular task system. If you (the dev) are the one doing all the code and running the workflows, it might work well for you. You build the core code and functions, and it handles the processing and resource stuff with a little config.
* https://github.com/celery/celery
Or lower level:
* https://github.com/dask/dask
2. DAG Workflow systems
There are also whole systems for what you're describing. They've gotten especially popular in the ML ops and data engineering world. A common one is AirFlow:
* https://github.com/apache/airflow
-
Apache Doris Job Scheduler for Task Automation
Job scheduling is an important part of data management as it enables regular data updates and cleanups. In a data platform, it is often undertaken by workflow orchestration tools like Apache Airflow and Apache Dolphinscheduler. However, adding another component to the data architecture also means investing extra resources for management and maintenance. That's why Apache Doris 2.1.0 introduces a built-in Job Scheduler. It is strategically more tailored to Apache Doris, and brings higher scheduling flexibility and architectural simplicity.
-
How I've implemented the Medallion architecture using Apache Spark and Apache Hdoop
Instead of the custom orchestrator I used, a proper orchestration tool should replace it like Apache Airflow, Dagster, ..., etc.
-
10 Open Source Tools for Building MLOps Pipelines
An integral part of an ML project is data acquisition and data transformation into the required format. This involves creating ETL (extract, transform, load) pipelines and running them periodically. Airflow is an open source platform that helps engineers create and manage complex data pipelines. Furthermore, the support for Python programming language makes it easy for ML teams to adopt Airflow.
-
AI Strategy Guide: How to Scale AI Across Your Business
Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.
What are some alternatives?
CKAN - CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
Kedro - Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
stable-baselines - A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
dagster - An orchestration platform for the development, production, and observation of data assets.
docarray - Represent, send, store and search multimodal data
n8n - Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
stable-baselines3 - PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
awesome-katas - A curated list of code katas
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
datasci-ctf - A capture-the-flag exercise based on data analysis challenges
Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more