np-sims
fast_vector_similarity
np-sims | fast_vector_similarity | |
---|---|---|
2 | 7 | |
14 | 325 | |
- | - | |
8.6 | 7.2 | |
6 months ago | 10 months ago | |
Python | Rust | |
MIT License | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
np-sims
-
Approximate Nearest Neighbors Oh Yeah
I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.
You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.
As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. It's also easy to update, as the hashes are non-parametric (they don't depend on the data).
The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska.
Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.
Hope that helps.
1 - https://github.com/softwaredoug/np-sims
-
Show HN: Fast Vector Similarity Using Rust and Python
Nice!
I recently implemented a C-based numpy solution of LSH to compress / recover cosine similarity[1]. It was my first time writing Numpy C, and it was a lot of fun to massively improve the performance over pure Python[2].
1- https://github.com/softwaredoug/np-sims
2- https://softwaredoug.com/blog/2023/08/22/rand-projections-in...
fast_vector_similarity
-
SentenceTransformers: Python framework for sentence, text and image embeddings
Yes, check out my library for vector similarity that has various other measures which are more discriminative:
https://github.com/Dicklesworthstone/fast_vector_similarity
pip install fast_vector_similarity
-
Show HN: Neum AI β Open-source large-scale RAG framework
Got it. I'd encourage you to expose more of that functionality at the level of your application if possible. I think there is a lot of potential in using more than just cosine similarity, especially when there are lots of candidates and you really want to sharpen up the top few recommendations to the best ones. You might find this open-source library I made recently useful for that:
https://github.com/Dicklesworthstone/fast_vector_similarity
I've had good results from starting with cosine similarity (using FAISS) and then "enriching" the top results from that with more sophisticated measures of similarity from my library to get the final ranking.
-
Some Reasons to Avoid Cython
You can see how I did something similar in my library here:
https://github.com/Dicklesworthstone/fast_vector_similarity/...
Basically you use ndarray instead of numpy, try to vectorize anything you can, and for the for loops that canβt be vectorized, you can use rayon to do them in parallel.
- FLaNK Stack Weekly 28 August 2023
- Fast Vector Similarity Library, Useful for Working With Llama2 Embedding Vectors
-
Show HN: Fast Vector Similarity Using Rust and Python
Yeah, like the other commenter said, everything is in this file here:
https://github.com/Dicklesworthstone/fast_vector_similarity/...
If you also make your project using Rust and Maturin, you can literally just copy and paste that into your project because it's totally generic, and if the repo is public, GitHub will just run it all for you for free.
The only thing is you need to create an account on PyPi (pip) and add 2-Factor Auth so you can generate an API key. Then you go into the repo settings and go to secrets, and create a Github Actions secret with the name PYPI_API_TOKEN and make the value your PyPi token. That's it! It will not only compile all the wheels for you but even upload the project to PyPi for you using the settings found in your pyproject.toml file, like this:
https://github.com/Dicklesworthstone/fast_vector_similarity/...
What are some alternatives?
swiss_army_llama - A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.
simsimd
qdrant - Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
QTVR - Tools for QTVR 1 files
DoctorGPT - π»ππ‘ DoctorGPT provides advanced LLM prompting for PDFs and webpages.
llama_embeddings_fastap