hamilton
talk-transcripts
hamilton | talk-transcripts | |
---|---|---|
26 | 35 | |
878 | 2,854 | |
- | - | |
8.1 | 4.7 | |
about 1 year ago | 11 months ago | |
Python | ||
BSD 3-clause Clear License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
hamilton
-
Write production grade pandas (and other libraries!) with Hamilton
And find the repository here: https://github.com/dagworks-inc/hamilton/
-
Useful libraries for data engineering in various programming languages
Python - https://github.com/stitchfix/hamilton (author here). It's great if you want your code to be always unit testable and documentation friendly, and you want to be able to visualize execution. Blog post on using it with Pandas https://link.medium.com/XhyYD9BAntb.
-
Cognitive Loads in Programming
Yes! As one of the creators of https://github.com/stitchfix/hamilton this was one of the aims. Simplifying the cognitive burden for those developing and managing data transforms over the course of years, and in particular for ones they didn't write!
For example in Hamilton -- we force people to write "declarative functions" which then are stitched together to create a dataflow.
E.g. example function -- my guess is that you can read and understand/guess what it does very easily.
-
Prefect vs other things question
For (1) there are quite a few options - prefect is one, metaflow is another, airflow, dagster, even https://github.com/stitchfix/hamilton (core contributor here), etc.
-
Field Lineage
If you're want to do more python https://github.com/stitchfix/hamilton allows you to model dependencies at a columnar (field) level.
- Show HN
-
[D] Is anyone working on interesting ML libraries and looking for contributors?
Take a look at https://github.com/stitchfix/hamilton - we're after contributors who can help us grow the project, e.g. make documentation great, dog fooding features and suggesting/contributing usability improvements.
-
Useful Python decorators for Data Scientists
For a real world example of their power, we built an entire framework (https://github.com/stitchfix/hamilton) at Stitch Fix, where a lot of cool magic is provide via decorators - see https://hamilton-docs.gitbook.io/docs/reference/api-reference/available-decorators and these two source files (https://github.com/stitchfix/hamilton/blob/main/hamilton/function_modifiers_base.py, https://github.com/stitchfix/hamilton/blob/main/hamilton/function_modifiers.py ). Note we do some non-trivial stuff via them.
-
unit tests
For data processing/transform code, I would recommend looking at https://github.com/stitchfix/hamilton, especially if you're trying to test pandas code. Short getting started here - https://towardsdatascience.com/how-to-use-hamilton-with-pandas-in-5-minutes-89f63e5af8f5 (disclaimer: I'm one of the authors).
-
Dealing with hundreds of customer/computed columns
The python package, hamilton, from Stitch Fix (https://hamilton-docs.gitbook.io/docs/) can help manage transformations on pandas dataframes. This DAG of transformations is managed separately in a file - so it can be versioned, in case the transformations change. The memory required is reduced, because only the API call tables and mapping parameter table have to be in memory. The calculated columns can be produced as needed. Just like dbt, transformations are separate from the source tables - but hamilton can be used on any python object - not just dataframes. dbt is SQL based.
talk-transcripts
-
In praise of idleness – Bertrand Russell
Reminds me a little of hammock-driven development [1]
> the background mind is good at synthesizing things. It's good about strategy
[1] https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
-
Teach Yourself Programming in Ten Years (1998)
Thank you for this recommendation. I've never heard of it before and now I'm reading: https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
It's giving me energy this Monday holiday(USA)!
-
Can't Be Fucked: Underrated Cause of Tech Debt
race?
> [Audience reply: Sprinter]
> Right, only somebody who runs really short races, okay?
> [Audience laughter]
> But of course, we are programmers, and we are smarter than runners, apparently, because we know how to fix that problem, right? We just fire the starting pistol every hundred yards and call it a new sprint.
https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
-
Strong typing, a hill I'm willing to die on
>So this is 10x, a full order of magnitude reduction in (?) severity before we get to the set of problems I think are more in the domain of what programming languages can help with, right? And because you can read these they'll all going to come up in a second as I go through each one on some slide so I'm not going to read them all out right now. But importantly there's another break where we get to trivialisms of problems in programming. Like typos and just being inconsistent, like, you thought you're going to have a list of strings and you put a number in there. That happens, you know, people make those kinds of mistakes, they're pretty inexpensive.
[0] Video: https://www.youtube.com/watch?v=2V1FtfBDsLU
[1] Slides and transcript: https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
[2] Video https://www.youtube.com/watch?v=YR5WdGrpoug
[3] Slides and transcript https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
-
Puzzle Languages
This is tangentially related to Puzzles-vs-Problems in Rich Hickey's Effective Programs
> Eventually I got back to scheduling and again wrote a new kind of scheduling system in Common Lisp, which again they did not want to run in production. And then I rewrote it in C++. Now at this point I was an expert C++ user and really loved C++, for some value of love. But as we'll see later I love the puzzle of C++. So I had to rewrite it in C++ and it took, you know, four times as long to rewrite it as it took to write it in the first place, it yielded five times as much code and it was no faster. And that's when I knew I was doing it wrong.
[...]
> So I mean for young programmers, if everybody's tired and old, this doesn't matter any more. But when I was young, when I was young, I really, you know, when you're young you've got lots of free space. I used to say "an empty head", but that's not right. You've got a lot of free space available and you can fill it with whatever you like. And these type systems they're quite fun, because from an endorphin standpoint solving puzzles and solving problems is the same, it gives you the same rush. Puzzle solving is really cool. But that's not what it should be about.
Talk: https://www.youtube.com/watch?v=2V1FtfBDsLU
Slides and transcript: https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
-
All the ways to capture changes in Postgres
Using triggers + history tables (aka audit tables) is the right answer 98% of the time. Just do it. If you're not already doing it, start today. It is a proven technique, in use for _over 30 years_.
Here's a quick rundown of how to do it generically https://gist.github.com/slotrans/353952c4f383596e6fe8777db5d... (trades off space efficiency for "being easy").
It's great if you can store immutable data. Really, really great. But you _probably_ have a ton of mutable data in your database and you are _probably_ forgetting a ton of it every day. Stop forgetting things! Use history tables.
cf. https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
Do not use Papertrail or similar application-space history tracking libraries/techniques. They are slow, error-prone, and incapable of capturing any DB changes that bypass your app stack (which you probably have, and should). Worth remembering that _any_ attempt to capture an "updated" timestamp from your app is fundamentally incorrect, because each of your webheads has its own clock. Use the database clock! It's the only one that's correct!
-
G. Polya, How to Solve It
Rich Hickey (creator of Clojure) references Polya several times in his classic talk "Hammock Driven Development". Here's a transcript:
https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
I've long been impressed by Hickey's problem solving skills, so I took much of this talk to heart, and even bought a copy of HTSI. Can't say it really helped me any more than Rich's talk (as a programmer) but I'm thinking I'll give it another look.
-
Interfaces All the Way Down
>Great product designs require no manual, and similarly, great interfaces need no documentation. Imagine having to read a manual on how to use a coffee mug.
This could not be more wrong.
Not everything is easy. If a library is addressing a complicated domain, solving by definition a complicated problem, it is fine if it requires some learning.
When did expertise and learning become bad things? If software is an engineering discipline, why would people in it ever promulgate the idea that any random cog can step in to any “engineer”s shoes?
Rich Hickey analogizes this mentality to the world of music, where it taken for granted that learning an instrument requires a lot of study:
“ We start with the cello. Should we make cellos that auto tune? Like, no matter where you put your finger, it's just going to play something good, play a good note.
“[Audience laughter]
“Like, you're good. We'll just fix that.
“ Should we have cellos with, like, red and green lights? Like, if you're playing the wrong note, you know, it's red. You slide around, and it's green. You're like, great! I'm good. I'm playing the right song. Right?
“ Or maybe we should have cellos that don't make any sound at all. Until you get it right, there's nothing.
“ [Audience laughter]”
https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
-
Slightly off-topic: Whose lectures do you recommend listening to, similar to Rich Hickey?
You might find adjacent talks and speakers here ... https://github.com/matthiasn/talk-transcripts
-
Functions vs. Procedures: Keep them separate.
Many languages merge the two concepts, and implement procedures as functions that return void. This may muddle/complect their distinction, causing programmers to call procedures from within functions, thereby making those functions into impure functions (meaning that they affect the world outside of themselves, through side-effects like I/O or mutating state). This should be avoided, especially if you care about debug-ability and Functional Core, Imperative Shell architectures (see Gary Bernhardt's Boundaries talk at 31:56) (which make testing your system easier, without mocking).
What are some alternatives?
prosto - Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
rich4clojure - Practice Clojure using Interactive Programming in your editor
versatile-data-kit - One framework to develop, deploy and operate data workflows with Python and SQL.
etaoin - Pure Clojure Webdriver protocol implementation
plumbing - Prismatic's Clojure(Script) utility belt
clj-chrome-devtools - Clojure API for controlling a Chrome DevTools remote
OpenLineage - An Open Standard for lineage metadata collection
codetour - VS Code extension that allows you to record and play back guided tours of codebases, directly within the editor.
composer - Supercharge Your Model Training
base - Unison base libraries
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
lumo - Fast, cross-platform, standalone ClojureScript environment