Creating an advanced search engine with PostgreSQL

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale

    https://supabase.com/blog/openai-embeddings-postgres-vector

    https://supabase.com/blog/chatgpt-supabase-docs

  • Typesense

    Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences

  • For something small with a minimal footprint, I'd recommend Typesense. https://github.com/typesense/typesense

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • fts-benchmark

    FTS benchmark comparing Postgres, Typesense, Meilisearch, OpenSearch and SQLite

  • It depends on what the user requirements are. FTS works pretty well with both Postgres and SQLite, in my experience.

    Here's a git repo someone can modify to do a cross comparison on a specific dataset, if they are interested. It doesn't seem to indicate the RMDBs are outclassed in a small-scale FTS implementation.

    https://github.com/VADOSWARE/fts-benchmark

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

  • If you're looking for a turn-key solution, I'd have to dig a little. I generally write a scraper in python that dumps into a database or flat file (depending on number of records I'm hunting).

    Scraping is a separate subject, but once you write one you can generally reuse relevant portions for many others. If you can get adept at a scraping framework like Scrapy you can do it fairly quickly, but there aren't many tools that work out of the box for every site you'll encounter.

    Once you've written the spider, it's generally able to be rerun for updates unless the site code is dramatically altered. It really comes down to how brittle the spider is coded (i.e. hunting for specific heading sizes or fonts or something) instead of grabbing the underlying JSON/XHR that doesn't usually change frequently.

    1. https://scrapy.org

  • zombodb

    Making Postgres and Elasticsearch work together like it's 2023

  • Curious, did you try zombodb? [https://www.zombodb.com/]

  • Crate

    CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

  • I'm wondering if CrateDB [https://github.com/crate/crate] could fit your use case.

    It's a relational SQL database which aims for compatibility with PostgreSQL. Internally it uses Lucene as a storage and such can offer fulltext functionality which is exposed via MATCH.

  • readability

    A standalone version of the readability lib

  • Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

    https://github.com/mozilla/readability

    Btw, readability, is also available in few other languages like Kotlin:

    https://github.com/dankito/Readability4J

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Readability4J

    A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.

  • Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

    https://github.com/mozilla/readability

    Btw, readability, is also available in few other languages like Kotlin:

    https://github.com/dankito/Readability4J

  • knowledge

    A knowledge daemon to collect ideas and auto organize them, with SQLite (by daitangio)

  • https://github.com/daitangio/knowledge

    Give it a try, it is very powerful

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts