Show HN: Search Engine for Blogs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • refined.blog

    curated list of personal blogs

    That's an awesome implementation and works really neat. I have been thinking to add this capability to https://refined.blog/ . also if you need tagged blog sites you can use our bloglist. also i previously posted in hn so there are some good blogs in here ( https://news.ycombinator.com/item?id=27973836 )

  • Typesense

    Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences

    > This is definitely very cool as I've been looking for something like this since technorati (which was originally a blog search engine).

    Technorati was one of the inspirations here so that's great to hear.

    > Would love to hear details about how you created the database, the infrastructure, etc if it's not a trade secret. Kudos on the launch!

    Sure, it's actually fairly simple! The search backend itself is running on Typesense [0], which was very quick and easy to setup.

    Due to the way ranking is calculated, I can actually avoid doing any real web crawling (though, I may add that in soon to help increase the index size). Ranking is based on submission to online communities, so all I really need is those submissions.

    Using the Reddit, HN and Twitter APIs, I search for any submissions related to any blogs in the database, then those submissions give me the post URLs.

    Once I have the post URLs, I just need to request those specific URLs to get the post data.

    Then there's scripts for things like content extraction, inflation calculation, currency conversion etc.

    All of those scripts are in python.

    The frontend is a simple React app built with Next. All pages are statically generated.

    Let me know if there's any more questions!

    0. https://typesense.org/

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • opensearch

    OpenSearch is a collection of simple formats for the sharing of search results. (by dewitt)

    I still quite like the idea of having a number of independent search engines each indexing their own specialist subjects, and one or more federated search front-ends which can pull these together.

    Doing it with APIs is a little tricky to make work in a usable way though. There have been various attempts at standardised APIs, e.g. OpenSearch[0], and metasearch engines like searX[1] have what are essentially pluggable scrapers, but there are still fundamental issues like getting different results at different times and having different ranking mechanisms.

    Integrating at the index level could make a more usable search, but there are lots of other issues with this approach, e.g. those with Apache Solr's Cross Data Centre Replication[2]. And yes, the volumes of data may also be an issue, given a search index will typically be slightly larger than the compressed data size, e.g. the 16M wikipedia docs are approx 32Gb compressed and approx 40.75Gb in a search index.

    [0] https://github.com/dewitt/opensearch , unrelated to Amazon's Elasticsearch fork

    [1] https://github.com/searx/searx

    [2] https://solr.apache.org/guide/8_11/cross-data-center-replica...

  • Searx

    Discontinued Privacy-respecting metasearch engine

    I still quite like the idea of having a number of independent search engines each indexing their own specialist subjects, and one or more federated search front-ends which can pull these together.

    Doing it with APIs is a little tricky to make work in a usable way though. There have been various attempts at standardised APIs, e.g. OpenSearch[0], and metasearch engines like searX[1] have what are essentially pluggable scrapers, but there are still fundamental issues like getting different results at different times and having different ranking mechanisms.

    Integrating at the index level could make a more usable search, but there are lots of other issues with this approach, e.g. those with Apache Solr's Cross Data Centre Replication[2]. And yes, the volumes of data may also be an issue, given a search index will typically be slightly larger than the compressed data size, e.g. the 16M wikipedia docs are approx 32Gb compressed and approx 40.75Gb in a search index.

    [0] https://github.com/dewitt/opensearch , unrelated to Amazon's Elasticsearch fork

    [1] https://github.com/searx/searx

    [2] https://solr.apache.org/guide/8_11/cross-data-center-replica...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts