A search engine in 80 lines of Python

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

lofi-dx

2 7 8.6 TypeScript

A small, fast, local-first, searchable index for client side apps written in Typescript. Supports required, negated, and phrase queries.

Hey, I tackled phrase matching in my toy project here: https://github.com/vasilionjea/lofi-dx/blob/main/test/search...
I think I tested it thoroughly but any feedback would be appreciated!

searcharray

4 159 9.7 Python

Full text search in your Pandas dataframe

This is really cool. I have a pretty fast BM25 search engine in Pandas I've been working on for local testing.
https://github.com/softwaredoug/searcharray
Why Pandas? Because BM25 is one thing, but you also want to combine with other factors (recency, popularity, etc) easily computed in pandas / numpy...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Django-link-archive

12 13 9.6 Python

Link archive for a NAS drive

I have myself dabbled a little bit in that subject. Some of my notes:
- some RSS feeds are protected by cloudflare. It is true however that it is not necessary for majority of blogs. If you would like to do more then selenium would be a way to solve "cloudflare" protected links
- sometimes even selenium headless is not enough and full blown browser in selenium is necessary to fool it's protection
- sometimes even that is not enough
- then I started to wonder, why some RSS feeds are so well protected by cloudflare, but who am I to judge?
- sometimes it is beneficial to cover user agent. I feel bad for setting my user agent to chrome, but again, why RSS feeds are so well protected?
- you cannot parse, read entire Internet, therefore you always need to think about compromises. For example I have narrowed area of my searches in one of my projects to domains only. Now I can find most of the common domains, and I sort them by their "importance"
- RSS links do change. There need to be automated means to disable some feeds automatically to prevent checking inactive domains
- I do not see any configurable timeout for reading a page, but I am not familiar with aiohttp. Some pages might waste your time
- I hate that some RSS feeds are not configured properly. Some sites do not provide a valid meta "link" with "application/rss+xml". Some RSS feeds have naive titles like "Home", or no title at all. Such a waste of opportunity
My RSS feed parser, link archiver, web crawler: https://github.com/rumca-js/Django-link-archive. Especially interesting could be file rsshistory/webtools.py. It is not advanced programming craft, but it got the job done.
Additionally, in other project I have collected around 2378 of personal sites. I collect domains in https://github.com/rumca-js/Internet-Places-Database/tree/ma... . These files are JSONs. All personal sites have tag "personal".
Most of the things are collected from:
https://nownownow.com/
https://searchmysite.net/
I wanted also to process domains from https://downloads.marginalia.nu/, but haven't got time to read structure of the files

Internet-Places-Database

11 21 9.3

Database of Internet places. Mostly domains

I have myself dabbled a little bit in that subject. Some of my notes:
- some RSS feeds are protected by cloudflare. It is true however that it is not necessary for majority of blogs. If you would like to do more then selenium would be a way to solve "cloudflare" protected links
- sometimes even selenium headless is not enough and full blown browser in selenium is necessary to fool it's protection
- sometimes even that is not enough
- then I started to wonder, why some RSS feeds are so well protected by cloudflare, but who am I to judge?
- sometimes it is beneficial to cover user agent. I feel bad for setting my user agent to chrome, but again, why RSS feeds are so well protected?
- you cannot parse, read entire Internet, therefore you always need to think about compromises. For example I have narrowed area of my searches in one of my projects to domains only. Now I can find most of the common domains, and I sort them by their "importance"
- RSS links do change. There need to be automated means to disable some feeds automatically to prevent checking inactive domains
- I do not see any configurable timeout for reading a page, but I am not familiar with aiohttp. Some pages might waste your time
- I hate that some RSS feeds are not configured properly. Some sites do not provide a valid meta "link" with "application/rss+xml". Some RSS feeds have naive titles like "Home", or no title at all. Such a waste of opportunity
My RSS feed parser, link archiver, web crawler: https://github.com/rumca-js/Django-link-archive. Especially interesting could be file rsshistory/webtools.py. It is not advanced programming craft, but it got the job done.
Additionally, in other project I have collected around 2378 of personal sites. I collect domains in https://github.com/rumca-js/Internet-Places-Database/tree/ma... . These files are JSONs. All personal sites have tag "personal".
Most of the things are collected from:
https://nownownow.com/
https://searchmysite.net/
I wanted also to process domains from https://downloads.marginalia.nu/, but haven't got time to read structure of the files

www.mechaelephant.com

3 1 8.8 JavaScript

website for www.mechaelephant.com
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Recess: Interactive RSS Aggregator

1 project | news.ycombinator.com | 22 Apr 2024
Recess Manifesto-Ish

1 project | news.ycombinator.com | 22 Apr 2024
DatoRSS - RSS Search Engine without frills

1 project | /r/coolgithubprojects | 15 Jan 2021
DatoRSS - like google but for RSS

1 project | /r/opensource | 27 Dec 2020
Show HN: OpenOrb, a curated search engine for Atom and RSS feeds

7 projects | news.ycombinator.com | 22 Apr 2024

A search engine in 80 lines of Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
search-engine RSS Aggregator Pandas JavaScript
Post date: 7 Feb 2024

lofi-dx

searcharray

InfluxDB

Django-link-archive

Internet-Places-Database

www.mechaelephant.com

SaaSHub

Related posts

Recess: Interactive RSS Aggregator

Recess Manifesto-Ish

DatoRSS - RSS Search Engine without frills

DatoRSS - like google but for RSS

Show HN: OpenOrb, a curated search engine for Atom and RSS feeds

A search engine in 80 lines of Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com search-engine RSS Aggregator Pandas JavaScript Post date: 7 Feb 2024

lofi-dx

searcharray

InfluxDB

Django-link-archive

Internet-Places-Database

www.mechaelephant.com

SaaSHub

Related posts

Recess: Interactive RSS Aggregator

Recess Manifesto-Ish

DatoRSS - RSS Search Engine without frills

DatoRSS - like google but for RSS

Show HN: OpenOrb, a curated search engine for Atom and RSS feeds

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
search-engine RSS Aggregator Pandas JavaScript
Post date: 7 Feb 2024