zingg
coreutils
zingg | coreutils | |
---|---|---|
23 | 113 | |
890 | 4,050 | |
2.5% | 1.7% | |
9.2 | 9.3 | |
4 days ago | 7 days ago | |
Java | C | |
GNU Affero General Public License v3.0 | GNU General Public License v3.0 only |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
zingg
-
Ask HN: What is the most impactful thing you've ever built?
As part of my data consulting, I struggled with identity resolution and started working on scalable no code identity resolution - https://github.com/zinggAI/zingg/ . It has pushed my limits as a software engineer and product builder, and I had to do a lot of learning to build it. Its cool to see people use Zingg in their workflows and save months of working on custom solutions. Big highlight has been North Carolina Open Campaign Data https://crossroads-cx.medium.com/building-open-access-to-nc-...
-
How to find open source data science python projects to contribute to?
Check https://github.com/zinggAI/zingg/. We recently added Python to our stack and are looking for help with building dbt-zingg python models, databricks-zingg python notebooks, python api, building a python based front end etc.
- Merging datasets
-
is it possible to "fuzzy match" or dedupe columns in Redshift?
If you are open to using a framework for this, check Zingg at https://github.com/zinggAI/zingg. It connects to Redshift, snowflake and other warehouses and can handle multiple columns
-
Show HN: Zingg – open-source entity resolution for single source of truth
Thanks for your support. Yes we do ship with some examples and their models which can be run out of the box. We have 3 customer demographic datasets and an ecommerce items matching across Google and Amazon. You can check them here https://github.com/zinggAI/zingg/tree/main/examples
-
Question about Github Referring Sites
I have an open source project hosted at https://github.com/zinggAI/zingg/.
- How do I promote the project appropriately?
-
GitHub Java Projects to Contribute
Check Zingg out at https://github.com/zinggAI/zingg and let me know if you would like to contribute
-
Match over 1 GB of data with inconsistent names
This is interesting, would love to get your feedback on Zingg(https://github.com/zinggAI/zingg) if you are upto it. Thanks!
-
Open source entity resolution - need your feedback!
I have released an open source entity resolution tool Zingg(https://github.com/zinggAI/zingg). Zingg uses Spark and ML to build single source of truth directly in the warehouse or the datalake. Would love to hear from the Reddit folks here what they think about it - do you find it useful? what can I do to make it better? any advice on the problem or the solution?
coreutils
-
GNU Coreutils 9.5 Can Yield 10~20% Throughput Boost For cp, mv and cat Commands
https://github.com/coreutils/coreutils/commit/fcfba90d0d27a1...
A summary of other changes just released in GNU coreutils 9.5 are:
* mv accepts --exchange to swap files
-
How the GNU coreutils are tested
> some are simple like yes(1)
Not that simple: https://github.com/coreutils/coreutils/blob/master/src/yes.c
-
Show HN: Usr/bin/env Docker run
The -S / --split-string option[1] of /usr/bin/env is a relatively recent addition to GNU Coreutils. It's available starting from GNU Coreutils 8.30[2], released on 2018-07-01.
Beware of portability: it relies on a non-standard behavior from some operating systems. It only works for OS's that treat all the text after the first space as argument(s) to the shebanged executable; rather than just treating the whole string as an executable path (that can happen to contain spaces).
Fortunately this non-standard behavior is more the norm than the exception: it works at least on modern GNU/Linux, BSDs, and macOS.
[1] https://www.gnu.org/software/coreutils/manual/html_node/env-...
[2] https://github.com/coreutils/coreutils/blob/b09dc6306e7affaf...
-
From Nand to Tetris: Building a Modern Computer from First Principles
> building a cat from scratch
> That would be an interesting project.
Here is the source code of the OpenBSD implementation of cat:
> https://github.com/openbsd/src/blob/master/bin/cat/cat.c
and here of the GNU coreutils implementation:
> https://github.com/coreutils/coreutils/blob/master/src/cat.c
Thus: I don't think building a cat from scratch or creating a tutorial about that topic is particularly hard (even though the HN audience would likely be interested in it). :-)
-
The Linux Scheduler: A Decade of Wasted Cores (2016) [pdf]
the yes command, writing to /dev/null, is making IO calls, which interfere with predictable scheduling.
If you look at the source code for yes, https://github.com/coreutils/coreutils/blob/master/src/yes.c
it builds a buffer of output and then writes that in a for loop
while (full_write (STDOUT_FILENO, buf, bufused) == bufused)
-
nohup not working?
Looking at the source of nohup, if the execvp() of the child happens then it _must_ have already done the signal (SIGHUP, SIG_IGN) so - WTF?
-
Is it fair to say "ls" is dead? No commits in 15 years
This got me wondering so I went and looked and it seems like lo and behold there was actually a commit to the GNU ls source just 2 weeks ago.
https://github.com/coreutils/coreutils/blob/master/src/ls.c
"maint: prefer char32_t to wchar_t"
- The Tao of Programming
-
Decoded: GNU Coreutils
even an empty file? Yes. so now it was a file with a copyright disclaimer and nothing else. And the koan-like question comes to mind is "Can you copyright nothing?" well AT&T sure tried.
Then somebody said our programs should be well defined and not depend on a fluke of unix, which at this point was probable a good idea. so it became "exit 0"
Then somebody said we should write our system utilities in C instead of shell so it runs faster. openbsd still has a good example of how this would look.
http://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/usr....
At some point gnu bureaucracy got involved and said all programs must support the '-h' flag. so that got added, then they said all programs must support locale so that got added. now days gnu true is an astonishing 80 lines long.
https://github.com/coreutils/coreutils/blob/master/src/true....
http://trillian.mit.edu/~jc/humor/ATT_Copyright_true.html
-
Exa Is Deprecated
> Yes, ls is maintained. Although, maintained is a very strong word. It exists.
Why would it be a strong word? Here it is, in src/ls.c: https://github.com/coreutils/coreutils
It is then packaged by tens of operating system distributions, who themselves maintain extra patchsets, some of which are then upstreamed.
It is installed and used on millions (billions?) of devices, for 3 decades.
It's a very reliable and trusty "sharp stick of metal" :)
What are some alternatives?
splink - Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
util-linux
clrs
madaidans-insecurities
rumble - ⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
busybox - BusyBox mirror
CLRS - Algorithms implementation in C++ and solutions of questions (both code and math proof) from “Introduction to Algorithms” (3e) (CLRS) in LaTeX.
src - Read-only git conversion of OpenBSD's official CVS src repository. Pull requests not accepted - send diffs to the tech@ mailing list.
skipledger - Differential privacy solution for maintaining and exposing information from evolving, append-only journals / ledgers.
linux - Linux kernel source tree
yt-channels-DS-AI-ML-CS - A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
gnulib - upstream mirror