wimsey
data-engineer-handbook
wimsey | data-engineer-handbook | |
---|---|---|
4 | 3 | |
128 | 27,545 | |
3.9% | 2.8% | |
7.3 | 9.1 | |
15 days ago | 14 days ago | |
Python | Jupyter Notebook | |
MIT License | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
wimsey
-
Classic Data science pipelines built with LLMs
I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening.
80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).
I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.
For areas that are reliability focused, LLMs still need a lot more improvments to be useful.
[0] https://github.com/benrutter/wimsey
-
The Data Engineering Handbook
Nice list! Although as somebody who works on open source tools for data engineering, it kills me a little to see "companies" as the the list header rather than, say, "projects".
(also, shameless plug for my.latest project Wimsey which is non-company affiliated but does let you test data in a nice, lightweight way: https://github.com/benrutter/wimsey)
- Wimsey: A flexible, lightweight data contracts library
-
This Week In Python
wimsey – Easy and flexible data testing and documentation
data-engineer-handbook
-
Data-engineer-handbook: everything to learn about data engineering
This thing points to some sort of github metrics dashboard.
The actual handbook is at: https://github.com/DataExpert-io/data-engineer-handbook
-
FLAIV-KING Weekly 25 Nov 2024
❄️ Snowflake Cortex AI + Slack 🌐 Mistral Multi Modal ❄️ Journey to Snowflake Monitoring Mastery ❄️ Best Practices for Using QueryTag in Snowflake 💻 CFP The Way 🦾 Multi Agent Framework AWS 📊 Cool Emojis 📈 bRAG LangChain 📝 Tool: What's in your stack? 📎 Awesome Event Driven Architecture Articles 📝 10 AI Open Source Tools 🫶 Documind 💻 Zerox AI 📈 Garak Open Source LLM Scanner 🦾 Magic Quill Image Project 🏃 vLLM to serve LLMs 🤖 OASIS Simulator 🙋🏻♂️ Creating Projects with UV 🛠️ RagFormation ✅ AutoKitteh Tool ✅ WebVM in the browser ✅ UV Pytorch ✅ Redis SQL Trino ✅ Bluesky Websocket Firehose in browser ✅ Data Engineer Handook ✅ Automatic Speech Recognition (ASR) on Edge Devices 🦾 Automatic Researcher with Ollama 🛠️ Podcastify Open Source 🙋🏻♂️ NVIDIA Jetpack 6.1 Upgrade
- The Data Engineering Handbook
What are some alternatives?
Scrapling - 🕷️ An undetectable, powerful, flexible, high-performance Python library that makes Web Scraping simple and easy again!
LLaVA-o1
finstruments - Financial instrument definitions built with Python and Pydantic
awesome-sqlite - A curated list of awesome things related to SQLite
abacus-minimal - A minimal event-based ledger in Python that follows accounting rules
awesome-selfhosted-data - machine-readable data for https://awesome-selfhosted.net