Creating an advanced search engine with PostgreSQL

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

nextjs-openai-doc-search

8 1,487 5.9 TypeScript

Template for building your own custom ChatGPT style doc search powered by Next.js, OpenAI, and Supabase.

https://supabase.com/blog/openai-embeddings-postgres-vector
https://supabase.com/blog/chatgpt-supabase-docs

Typesense

129 17,876 9.8 C++

Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences

For something small with a minimal footprint, I'd recommend Typesense. https://github.com/typesense/typesense

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
fts-benchmark

1 20 10.0 JavaScript

FTS benchmark comparing Postgres, Typesense, Meilisearch, OpenSearch and SQLite

It depends on what the user requirements are. FTS works pretty well with both Postgres and SQLite, in my experience.
Here's a git repo someone can modify to do a cross comparison on a specific dataset, if they are interested. It doesn't seem to indicate the RMDBs are outclassed in a small-scale FTS implementation.
https://github.com/VADOSWARE/fts-benchmark

Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

If you're looking for a turn-key solution, I'd have to dig a little. I generally write a scraper in python that dumps into a database or flat file (depending on number of records I'm hunting).
Scraping is a separate subject, but once you write one you can generally reuse relevant portions for many others. If you can get adept at a scraping framework like Scrapy you can do it fairly quickly, but there aren't many tools that work out of the box for every site you'll encounter.
Once you've written the spider, it's generally able to be rerun for updates unless the site code is dramatically altered. It really comes down to how brittle the spider is coded (i.e. hunting for specific heading sizes or fonts or something) instead of grabbing the underlying JSON/XHR that doesn't usually change frequently.
1. https://scrapy.org

zombodb

23 4,608 8.3 PLpgSQL

Making Postgres and Elasticsearch work together like it's 2023

Curious, did you try zombodb? [https://www.zombodb.com/]

Crate

6 3,957 9.9 Java

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

I'm wondering if CrateDB [https://github.com/crate/crate] could fit your use case.
It's a relational SQL database which aims for compatibility with PostgreSQL. Internally it uses Lucene as a storage and such can offer fulltext functionality which is exposed via MATCH.

readability

51 8,056 6.3 JavaScript

A standalone version of the readability lib

Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.
https://github.com/mozilla/readability
Btw, readability, is also available in few other languages like Kotlin:
https://github.com/dankito/Readability4J

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Readability4J

3 135 4.3 HTML

A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.

Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.
https://github.com/mozilla/readability
Btw, readability, is also available in few other languages like Kotlin:
https://github.com/dankito/Readability4J

knowledge

1 1 3.0 Shell

A knowledge daemon to collect ideas and auto organize them, with SQLite (by daitangio)

https://github.com/daitangio/knowledge
Give it a try, it is very powerful

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project