examples VS ann-benchmarks

Compare examples vs ann-benchmarks and see what are their differences.

examples

Analyze the unstructured data with Towhee, such as reverse image search, reverse video search, audio classification, question and answer systems, molecular search, etc. (by towhee-io)

ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python (by erikbern)
Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
examples ann-benchmarks
5 51
376 4,588
10.9% -
6.8 8.1
3 months ago 4 days ago
Jupyter Notebook Python
Apache License 2.0 MIT License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

examples

Posts with mentions or reviews of examples. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-08-07.
  • FLaNK Stack Weekly for 07August2023
    27 projects | dev.to | 7 Aug 2023
  • Vector database built for scalable similarity search
    19 projects | news.ycombinator.com | 25 Mar 2023
    As another commenter noted, Milvus is overkill and a "bit much" if you're learning/playing.

    A good intro to the field with progression towards a full Milvus implementation could be starting with towhee[0] (which is also supported by Milvus).

    towhee has an example to do exactly what you want with CLIP[1].

    [0] - https://towhee.io/

    [1] - https://github.com/towhee-io/examples/tree/main/image/text_i...

  • Ask HN: Any good self-hosted image recognition software?
    6 projects | news.ycombinator.com | 22 Sep 2022
    Usually this is done in three steps. The first step is using a neural network to create a bounding box around the object, then generating vector embeddings of the object, and then using similarity search on vector embeddings.

    The first step is accomplished by training a detection model to generate the bounding box around your object, this can usually be done by finetuning an already trained detection model. For this step the data you would need is all the images of the object you have with a bounding box created around it, the version of the object doesnt matter here.

    The second step involves using a generalized image classification model thats been pretrained on generalized data (VGG, etc.) and a vector search engine/vector database. You would start by using the image classification model to generate vector embeddings (https://frankzliu.com/blog/understanding-neural-network-embe...) of all the different versions of the object. The more ground truth images you have, the better, but it doesn't require the same amount as training a classifier model. Once you have your versions of the object as embeddings, you would store them in a vector database (for example Milvus: https://github.com/milvus-io/milvus).

    Now whenever you want to detect the object in an image you can run the image through the detection model to find the object in the image, then run the sliced out image of the object through the vector embedding model. With this vector embedding you can then perform a search in the vector database, and the closest results will most likely be the version of the object.

    Hopefully this helps with the general rundown of how it would look like. Here is an example using Milvus and Towhee https://github.com/towhee-io/examples/tree/3a2207d67b10a246f....

    Disclaimer: I am a part of those two open source projects.

  • Deep Dive into Real-World Image Search Engine with Python
    2 projects | /r/Python | 17 May 2022
    I have shown how to Build an Image Search Engine in Minutes in the previous tutorial. Here is another one for how to optimize the algorithm, feed it with large-scale image datasets, and deploy it as a micro-service.
  • Build an Image Search Engine in Minutes
    3 projects | /r/Python | 15 May 2022
    The full tutorial is at https://github.com/towhee-io/examples/blob/main/image/reverse_image_search/build_image_search_engine.ipynb

ann-benchmarks

Posts with mentions or reviews of ann-benchmarks. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-10-30.
  • Using Your Vector Database as a JSON (Or Relational) Datastore
    1 project | news.ycombinator.com | 23 Apr 2024
    On top of my head, pgvector only supports 2 indexes, those are running in memory only. They don't support GPU indexing, nor Disk based indexing, they also don't have separation of query and insertions.

    Also with different people I've talked to, they struggle with scale past 100K-1M vector.

    You can also have a look yourself from a performance perspective: https://ann-benchmarks.com/

  • ANN Benchmarks
    1 project | news.ycombinator.com | 25 Jan 2024
  • Approximate Nearest Neighbors Oh Yeah
    5 projects | news.ycombinator.com | 30 Oct 2023
    https://ann-benchmarks.com/ is a good resource covering those libraries and much more.
  • pgvector vs Pinecone: cost and performance
    1 project | dev.to | 23 Oct 2023
    We utilized the ANN Benchmarks methodology, a standard for benchmarking vector databases. Our tests used the dbpedia dataset of 1,000,000 OpenAI embeddings (1536 dimensions) and inner product distance metric for both Pinecone and pgvector.
  • Vector database is not a separate database category
    3 projects | news.ycombinator.com | 2 Oct 2023
    Data warehouses are columnar stores. They are very different from row-oriented databases - like Postgres, MySQL. Operations on columns - e.g., aggregations (mean of a column) are very efficient.

    Most vector databases use one of a few different vector indexing libraries - FAISS, hnswlib, and scann (google only) are popular. The newer vector dbs, like weaviate, have introduced their own indexes, but i haven't seen any performance difference -

    Reference: https://ann-benchmarks.com/

  • How We Made PostgreSQL a Better Vector Database
    2 projects | news.ycombinator.com | 25 Sep 2023
    (Blog author here). Thanks for the question. In this case the index for both DiskANN and pgvector HNSW is small enough to fit in memory on the machine (8GB RAM), so there's no need to touch the SSD. We plan to test on a config where the index size is larger than memory (we couldn't this time due to limitations in ANN benchmarks [0], the tool we use).

    To your question about RAM usage, we provide a graph of index size. When enabling PQ, our new index is 10x smaller than pgvector HNSW. We don't have numbers for HNSWPQ in FAISS yet.

    [0]: https://github.com/erikbern/ann-benchmarks/

  • Do we think about vector dbs wrong?
    7 projects | news.ycombinator.com | 5 Sep 2023
  • Vector Search with OpenAI Embeddings: Lucene Is All You Need
    2 projects | news.ycombinator.com | 3 Sep 2023
    In terms of "All You Need" for Vector Search, ANN Benchmarks (https://ann-benchmarks.com/) is a good site to review when deciding what you need. As with anything complex, there often isn't a universal solution.

    txtai (https://github.com/neuml/txtai) can build indexes with Faiss, Hnswlib and Annoy. All 3 libraries have been around at least 4 years and are mature. txtai also supports storing metadata in SQLite, DuckDB and the next release will support any JSON-capable database supported by SQLAlchemy (Postgres, MariaDB/MySQL, etc).

  • Vector databases: analyzing the trade-offs
    5 projects | news.ycombinator.com | 20 Aug 2023
    pg_vector doesn't perform well compared to other methods, at least according to ANN-Benchmarks (https://ann-benchmarks.com/).

    txtai is more than just a vector database. It also has a built-in graph component for topic modeling that utilizes the vector index to autogenerate relationships. It can store metadata in SQLite/DuckDB with support for other databases coming. It has support for running LLM prompts right with the data, similar to a stored procedure, through workflows. And it has built-in support for vectorizing data into vectors.

    For vector databases that simply store vectors, I agree that it's nothing more than just a different index type.

  • Vector Dataset benchmark with 1536/768 dim data
    3 projects | news.ycombinator.com | 14 Aug 2023
    The reason https://ann-benchmarks.com is so good, is that we can see a plot of recall vs latency. I can see you have some latency numbers in the leaderboard at the bottom, but it's very difficult to make a decision.

    As a practitioner that works with vector databases every day, just latency is meaningless to me, because I need to know if it's fast AND accurate, and what the tradeoff is! You can't have it both ways. So it would be helpful if you showed plots showing this tradeoff, similar to ann-benchmarks.

What are some alternatives?

When comparing examples and ann-benchmarks you can also consider the following projects:

towhee - Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

pgvector - Open-source vector similarity search for Postgres

milvus-lite - A lightweight version of Milvus wrapped with Python.

faiss - A library for efficient similarity search and clustering of dense vectors.

gorilla-cli - LLMs for your CLI

Milvus - A cloud-native vector database, storage for next generation AI applications

anomalib - An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.

tlsh

EverythingApacheNiFi - EverythingApacheNiFi

vald - Vald. A Highly Scalable Distributed Vector Search Engine

harlequin - The SQL IDE for Your Terminal.

pgANN - Fast Approximate Nearest Neighbor (ANN) searches with a PostgreSQL database.