A search engine in 80 lines of Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • lofi-dx

    A small, fast, local-first, searchable index for client side apps written in Typescript. Supports required, negated, and phrase queries.

  • Hey, I tackled phrase matching in my toy project here: https://github.com/vasilionjea/lofi-dx/blob/main/test/search...

    I think I tested it thoroughly but any feedback would be appreciated!

  • searcharray

    Full text search in your Pandas dataframe

  • This is really cool. I have a pretty fast BM25 search engine in Pandas I've been working on for local testing.

    https://github.com/softwaredoug/searcharray

    Why Pandas? Because BM25 is one thing, but you also want to combine with other factors (recency, popularity, etc) easily computed in pandas / numpy...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • I have myself dabbled a little bit in that subject. Some of my notes:

    - some RSS feeds are protected by cloudflare. It is true however that it is not necessary for majority of blogs. If you would like to do more then selenium would be a way to solve "cloudflare" protected links

    - sometimes even selenium headless is not enough and full blown browser in selenium is necessary to fool it's protection

    - sometimes even that is not enough

    - then I started to wonder, why some RSS feeds are so well protected by cloudflare, but who am I to judge?

    - sometimes it is beneficial to cover user agent. I feel bad for setting my user agent to chrome, but again, why RSS feeds are so well protected?

    - you cannot parse, read entire Internet, therefore you always need to think about compromises. For example I have narrowed area of my searches in one of my projects to domains only. Now I can find most of the common domains, and I sort them by their "importance"

    - RSS links do change. There need to be automated means to disable some feeds automatically to prevent checking inactive domains

    - I do not see any configurable timeout for reading a page, but I am not familiar with aiohttp. Some pages might waste your time

    - I hate that some RSS feeds are not configured properly. Some sites do not provide a valid meta "link" with "application/rss+xml". Some RSS feeds have naive titles like "Home", or no title at all. Such a waste of opportunity

    My RSS feed parser, link archiver, web crawler: https://github.com/rumca-js/Django-link-archive. Especially interesting could be file rsshistory/webtools.py. It is not advanced programming craft, but it got the job done.

    Additionally, in other project I have collected around 2378 of personal sites. I collect domains in https://github.com/rumca-js/Internet-Places-Database/tree/ma... . These files are JSONs. All personal sites have tag "personal".

    Most of the things are collected from:

    https://nownownow.com/

    https://searchmysite.net/

    I wanted also to process domains from https://downloads.marginalia.nu/, but haven't got time to read structure of the files

  • Internet-Places-Database

    Database of Internet places. Mostly domains

  • I have myself dabbled a little bit in that subject. Some of my notes:

    - some RSS feeds are protected by cloudflare. It is true however that it is not necessary for majority of blogs. If you would like to do more then selenium would be a way to solve "cloudflare" protected links

    - sometimes even selenium headless is not enough and full blown browser in selenium is necessary to fool it's protection

    - sometimes even that is not enough

    - then I started to wonder, why some RSS feeds are so well protected by cloudflare, but who am I to judge?

    - sometimes it is beneficial to cover user agent. I feel bad for setting my user agent to chrome, but again, why RSS feeds are so well protected?

    - you cannot parse, read entire Internet, therefore you always need to think about compromises. For example I have narrowed area of my searches in one of my projects to domains only. Now I can find most of the common domains, and I sort them by their "importance"

    - RSS links do change. There need to be automated means to disable some feeds automatically to prevent checking inactive domains

    - I do not see any configurable timeout for reading a page, but I am not familiar with aiohttp. Some pages might waste your time

    - I hate that some RSS feeds are not configured properly. Some sites do not provide a valid meta "link" with "application/rss+xml". Some RSS feeds have naive titles like "Home", or no title at all. Such a waste of opportunity

    My RSS feed parser, link archiver, web crawler: https://github.com/rumca-js/Django-link-archive. Especially interesting could be file rsshistory/webtools.py. It is not advanced programming craft, but it got the job done.

    Additionally, in other project I have collected around 2378 of personal sites. I collect domains in https://github.com/rumca-js/Internet-Places-Database/tree/ma... . These files are JSONs. All personal sites have tag "personal".

    Most of the things are collected from:

    https://nownownow.com/

    https://searchmysite.net/

    I wanted also to process domains from https://downloads.marginalia.nu/, but haven't got time to read structure of the files

  • www.mechaelephant.com

    website for www.mechaelephant.com

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts