-
As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.
-
InfluxDB
InfluxDB β Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
showcase-songs-search
A site to instantly search 32M songs from the MusicBrainz songs database, using Typesense Search (an open source alternative to Algolia / ElasticSearch) β‘ π΅ π
I'm biased, but I'd recommend exploring Typesense for search.
It's an open source alternative to Algolia + Pinecone and e-commerce is a very common use-case.
Here's a live demo with 32M songs: https://songs-search.typesense.org/
Disclaimer: I work on Typesense.
-
usearch
Fast Open-Source Search & Clustering engine Γ for Vectors & π Strings Γ in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram π
As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)
https://github.com/unum-cloud/usearch - for faster search
https://github.com/unum-cloud/uform - for cheaper multi-lingual multi-modal embeddings
-
uform
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and π video, up to 5x faster than OpenAI CLIP and LLaVA πΌοΈ & ποΈ
As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)
https://github.com/unum-cloud/usearch - for faster search
https://github.com/unum-cloud/uform - for cheaper multi-lingual multi-modal embeddings