splink
codebase-visualizer-action
splink | codebase-visualizer-action | |
---|---|---|
16 | 11 | |
1,104 | 61 | |
4.0% | - | |
9.9 | 0.0 | |
6 days ago | over 1 year ago | |
Python | ||
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
splink
- Splink: Fast, accurate, scalable probabilistic data linkage
-
Ask HN: What projects are you working on?
https://github.com/moj-analytical-services/splink
-
Record linkage/Entity linkage
Record linkage has been a big part of a project I've been working on for 6 months now. I personally think a great and free solution be using the splink package in Python which can handle 10+m rows which implements the Fellegi-Sunter model (equivalent to a naive-Bayes model) is the classical model in record linkage. It can be trained in an unsupervised manner using some initial parameter estimation (these are quite intuitive) and then expectation maximisation. The features in the model will be different pairwise string comparisons on your field of interest. These can include exact equality; edit distance comparisons like Levensthein distance and Jaro-Winkler; and phonetic comparisons like soundex and double metaphone. The splink pacakge will handle training the model and then all the graph theory at the end to connect all your links into clusters. All the details you'll need are in the links. https://www.robinlinacre.com/probabilistic\_linkage/ https://moj-analytical-services.github.io/splink/
-
What is the best approach to removing duplicate person records if the only identifier is person firstname middle name and last name? These names are entered in varying ways to the DB, thus they are free-fromatted.
https://moj-analytical-services.github.io/splink/ is a FOSS python package (but it runs against your db using SQL).
-
DuckDB – in-process SQL OLAP database management system
If you're curious, I've written a FOSS record linkage library that executes everything as SQL. It supports multiple SQL backends including DuckDB and Spark for scale, and runs faster than most competitors because it's able to leverage the speed of these backends: https://github.com/moj-analytical-services/splink
-
Ask HN: What have you created that deserves a second chance on HN?
Splink - a python library for probabilistic record linkage (fuzzy matching/entity resolution).
Splink is dramatically faster and works on much larger datasets than other open source libraries. I'm particularly proud of the fact we support multiple execution backends (at the moment, DuckDb Spark Athena and Sqlite, but additional adaptors are relatively straightforward to write).
We've had >4 million pypi downloads and it's used in government, academia and the private sector, often replacing extremely expensive proprietary solutions.
https://github.com/moj-analytical-services/splink
More info in blog posts here:
-
Conformed Dimensions problem that keeps recurring on every project
Splink is a SQL tool that can do this https://github.com/moj-analytical-services/splink
-
How do you join two sources with attributes that aren't identical?
Probabilistic record matching model such as a Fellegi-Sunter. Check out the splink package in Python.
-
Splink 3: Fast, accurate and scalable record linkage (entity resolution) in Python
Main docs here: https://moj-analytical-services.github.io/splink
-
Splink 3: Fast, accurate and scalable fuzzy record linkage in Python with support for multiple backends (FOSS)
It'd be great to see Splink add value in this area! Do give us a shout if you have any questions. The best place to post is on the Github discussions: https://github.com/moj-analytical-services/splink/discussions
codebase-visualizer-action
-
Treemaps Are Awesome!
Nice post - treemaps are great!
My friend and I made a codebase visualisation tool (https://www.codeatlas.dev/gallery) that's based on Voronoi treemaps, maybe of interest as an illustration of the aesthetics with a non-rectangular layout!
We've opted for zooming through double-clicks as the main method of navigating the map, because in deep codebases, the individual cells quickly get too small to accurately target with the cursor as shown in the key-path label approach!
If anyone's interested, this is also available as a Github Action to generate the treemap during CI: https://github.com/codeatlasHQ/codebase-visualizer-action
-
Gource – Animate your Git history
If you find this type of codebase visualisation useful, you might want to checkout codeatlas.dev and its Github Action (https://github.com/codeatlasHQ/codebase-visualizer-action). It doesn't animate the repo over time like gource (yet), but instead aims to give a beautiful interactive visual snapshot of a repo at a particular point in time. It also lets you zoom in on specific aspects like recent commit activity, programming language and hopefully in the future test coverage.
E.g. see here for a visualisation of the pytorch codebase we did a while ago: https://codeatlas.dev/gallery/pytorch/pytorch
(disclaimer: I'm the author)
-
Show HN: Git Heat Map – a tool for visualising Git repo activity for each file
If you think this is useful, you might also like codeatlas.dev and its Github Action (https://github.com/codeatlasHQ/codebase-visualizer-action). It currently does not support per-contributor activity, but we put a lot of effort into making the diagrams beautiful to look at and the basic approach of using treemaps for visualisation seems very similar. In fact, could be cool to collaborate on this, DM me if interested!
https://codeatlas.dev
-
Ask HN: Those making $0/month or less on side projects – Show and tell
https://codeatlas.dev - codebase visualisation tool
Takes your git repo and generates a beautiful visual representation of the code. Sort of an alternative navigation tool (in addition to IDEs) for large codebases. Can also run it as part of CI with our Github Action (https://github.com/codeatlasHQ/codebase-visualizer-action).
We made this because grokking complex software projects is really difficult and we've found that a visual overview of what's in a codebase can be quite helpful to get started.
E.g. checkout https://codeatlas.dev/gallery/kubernetes/kubernetes for the generated visualisation of the Kubernetes Github repo!
Currently making -10$/year to pay for the domain :D We slowed down active development after our initial attempts at dissemination didn't really go anywhere (bragging about side projects on the internet, ugh), but I'm still really keen on getting some feedback on whether this is actually useful to anyone else!
Note: The site works somewhat on mobile, but is much better on desktop!
Also, funny there's a post like this again, just like https://news.ycombinator.com/item?id=34531989 yesterday.
-
Ask HN: What have you created that deserves a second chance on HN?
https://codeatlas.dev - codebase visualisation tool
It takes your git repo and generates a beautiful visual representation of the actual code that's in it. Sort of an alternative navigation tool (in addition to IDEs) for large codebases. You can run codeatlas as part of your CI with our Github Action (https://github.com/codeatlasHQ/codebase-visualizer-action).
We made this because grokking complex software projects is really difficult and we've found that a visual overview of what's in a codebase can be quite helpful to get started.
E.g. checkout https://codeatlas.dev/gallery/kubernetes/kubernetes for the generated visualisation of the Kubernetes Github repo!
We slowed down active development after our initial attempts at dissemination didn't really go anywhere (bragging about side projects on the internet, ugh), but would still love feedback on whether this is possibly useful to anyone else!
Note: The site works somewhat on mobile, but is much better on desktop!
- Show HN: Codeatlas – Visualize your codebases during CI
-
Ask HN: Why aren't code diagram generating tools more common?
I've already mentioned this on the other thread (https://news.ycombinator.com/item?id=31569646), but my friend and I have been working on [https://www.codeatlas.dev](https://www.codeatlas.dev/) as a sideproject - it's a tool for creating pretty (2D!) visualisations of codebases, while providing additional insights via overlays (e.g. commit density, programming language or other results from static analysis like dead code/test coverage/etc.). For example here's the Kubernetes codebase visualised using codeatlas: [https://www.codeatlas.dev/repo/kubernetes/kubernetes](https:....
At the moment, codeatlas is just the static gallery, but we're only a few weekends away from releasing a Github action that deploys this diagram on github pages for your own repos - if you're interested, feel free to watch this repo: https://github.com/codeatlasHQ/codebase-visualizer-action
OP, how close is this to what you had in mind in your question?
-
Ask HN: Visualizing software designs, especially of large systems (if at all)?
My friend and I have been working on https://www.codeatlas.dev in our spare time, which is a tool that creates pretty (2D!) visualisations of codebases, while providing additional insights via overlays (e.g. commit density, programming language). For example here's the Kubernetes codebase visualised using codeatlas: https://www.codeatlas.dev/repo/kubernetes/kubernetes.
At the moment, codeatlas is only a static gallery, but we're currently about 1-2 weekends away from releasing a Github action that deploys this diagram on github pages for your own repos - if you're interested, feel free to watch this repo: https://github.com/codeatlasHQ/codebase-visualizer-action
What are some alternatives?
zingg - Scalable identity resolution, entity resolution, data mastering and deduplication using ML
spekt8 - Visualize your Kubernetes cluster in real time
dedupe - :id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
TypeScript-Call-Graph - CLI to generate an interactive graph of functions and calls from your TypeScript files
libpostal - A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
jtree - Build your own language using Tree Notation.
sqlglot - Python SQL Parser and Transpiler
scipipe - Robust, flexible and resource-efficient pipelines using Go and the commandline
entity-embed - PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
dbcview - Quickly visualize senders and receivers in a DBC
dblink - Distributed Bayesian Entity Resolution in Apache Spark
atomic - Chat with and teach your calendar to solve your scheduling & time problems