data-making-guidelines
dedupe
data-making-guidelines | dedupe | |
---|---|---|
1 | 9 | |
286 | 3,984 | |
0.3% | 0.6% | |
1.8 | 7.1 | |
over 3 years ago | about 2 months ago | |
HTML | Python | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
data-making-guidelines
-
Ask HN: Freelancer? Seeking freelancer? (May 2021)
SEEKING FREELANCER | Greater NY Area | Remote | Volunteer Stipend | Covid-19 Vaccine Related
Join a group of MDs, data junkies, and political gurus working on data-driven methods for managing equitable distribution of covid vaccine into some of the most vulnerable communities in the country.
YOU = AN ETL GOD :)
Looking specifically for journeyman-level expertise in building simple yet powerful DQ and ingestion pipelines. Data is a grab bag from sources like cdc, census, state immunization registries, community vulnerability indices, etc...
Approach we have in mind is heavily inspired by this:
https://github.com/datamade/data-making-guidelines/blob/mast...
We are also considering the possibility of making it a bit more easy to manage by using Airflow to keep track of the ongoing DAGs.
AND IF ETL ISN'T YOUR THING...
All of that said, if you're interested in working in data analysis tied to the distribution and management of vaccination opportunities but have different skill sets, I would love to hear from you.
Our work in this area continues to get visibility from leaders in the field and we have several publication-ready findings they could be directly applicable to any future mass vaccination challenges the world faces.
Contact info:
dedupe
- Using deep learning for Fuzzy Matching
-
String distance based network for fuzzy matching?
I think this problem is known as data deduplication, in particular, entity deduplication. I googled a bit and it seems approaches vary from manual deduplication to some sort of active learning (if I am not mistaken). I am also curios if pre-trained transformer-based cross encoders can provide any good results (they are trained on sentences I think, but may be worth a try). Another problem here is how to measure progress (compare different approaches)?
-
What's the toughest DE problem you faced in your work career?
I've had a good experience in the past with the dedupe package for these type activities. Unsure if it works for out-of-core type situations though, as my data set fit easily into memory.
-
Model detects duplicate records
Data deduplication is a super common problem, so it's useful experience to work on it. It's generally useful for companies, but I don't think it could be sold as a product unless is solving a very complicated, domain-specific de-duping problem. Otherwise, there are generic, open source de-duping tools such as: dedupe. It sounds like your model is similar to that.
- [D] Suggestions for large-scale company name standardization?
- Entity Resolution with Magniv
- How to do fuzzy matching in Redshift? A Python UDF, for example?
-
[OC] Media bias? US Sunday news shows book Republicans more than Democrats: Three of the five top Sunday news shows, altogether watched by almost 8 million people weekly, featured Republican partisans more often than Democrats in episodes aired this year through Oct. 31.
Tools used: Python to scrape guest lists, dedupeio to better identify guests, Google Sheets to store and analyze the data, and Datawrapper to make the charts.
-
Does there exist a python package that clears the dataset/columns in terms of exact and similar duplicates?
Try https://github.com/dedupeio/dedupe
What are some alternatives?
awesome-cto - A curated and opinionated list of resources for Chief Technology Officers, with the emphasis on startups
splink - Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
design-patterns-for-humans - An ultra-simplified explanation to design patterns
imgdupes - Identifying and removing near-duplicate images using perceptual hashing.
hacker-laws - 💻📖 Laws, Theories, Principles and Patterns that developers will find useful. #hackerlaws
orange - 🍊 :bar_chart: :bulb: Orange: Interactive data analysis
svd-client-sample
bees - Best-Effort Extent-Same, a btrfs dedupe agent
Design Patterns - Design patterns implemented in Java
pyDenStream - Implementation of the DenStream algorithm in Python.
hazelcast-python-client - Hazelcast Python Client
relevanceai - Home of the AI workforce - Multi-agent system, AI agents & tools