A Unix-style personal search engine and web crawler for your digital footprint

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

falcon

5 1,805 0.0 JavaScript

Chrome extension for full text history search! (by lengstrom)

In the issues someone says that it works even in FF. You just need to change the extension of the file. Tho I didn't try it yet.
https://github.com/lengstrom/falcon/issues/73#issuecomment-6...

apollo

7 1,360 0.0 Go

A Unix-style personal search engine and web crawler for your digital footprint. (by amirgamil)
SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
ripgrep-all

43 6,164 8.0 Rust

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.
A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience.
---
Two things I'd mention are:
1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint
2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

readability

51 8,056 6.3 JavaScript

A standalone version of the readability lib

Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.
A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience.
---
Two things I'd mention are:
1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint
2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

dogsheep-beta

2 178 0.0 Python

Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette

My version of this is https://dogsheep.github.io/ - the idea is to pull your digital footprint from various different sources (Twitter, Foursquare, GitHub etc) into SQLite database files, then run Datasette on top to explore them.
On top of that I built a search engine called Dogsheep Beta which builds a full-text search index across all of the different sources and lets you search in one place: https://github.com/dogsheep/dogsheep-beta
You can see a live demonstration of that search engine on the Datasette website: https://datasette.io/-/beta?q=dogsheep

Shiori

58 8,685 8.3 Go

Simple bookmark manager built with Go
hn-search

1,617 524 2.9 TypeScript

Hacker News Search

It's failed to make the homepage a few times in the past: https://hn.algolia.com/?q=dogsheep - the one time it did make it was this one about Dogsheep Photos: https://news.ycombinator.com/item?id=23271053

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
parser

12 5,245 1.1 JavaScript

📜 Extract meaningful content from the chaos of a web page

Sadly not - I'd love it to do that, but the Pocket API doesn't make that available.
I've been contemplating building an add-on for Dogsheep that can do this for any given URL (from Pocket or other sources) by shelling out to an archive script such as https://github.com/postlight/mercury-parser - I collected some suggestions for libraries to use here: https://twitter.com/simonw/status/1401656327869394945
That way you could save a URL using Pocket or browser bookmarks or Pinboard or anything else that I can extract saved URLs from an a separate script could then archive the full contents for you.

go-find-hexagonal

1 1 0.0 Go

Applying what I learned from https://www.youtube.com/watch?v=oL6JBUk6tj0
zotero

254 9,176 9.9 JavaScript

Zotero is a free, easy-to-use tool to help you collect, organize, annotate, cite, and share your research sources.
notational-fzf-vim

13 1,116 0.0 Vim Script

Notational velocity for vim.

I use nb https://github.com/xwmx/nb for both bookmarks and notetaking. nb downloads a shallow copy of the link and stores it along with the bookmark.
All notes (and consequently bookmarks and their contents) are stored as plain-text markdown files - so there's no dependency on a proprietary format, and all the content becomes searchable.
If you're a vim-user, you can also get the notational-fzf-vim plugin (https://github.com/alok/notational-fzf-vim) and point it to the notes/bookmarks folder, and have full fuzzy search over all the content.

Camlistore

29 6,390 7.9 Go

Perkeep (née Camlistore) is your personal storage system for life: a way of storing, syncing, sharing, modelling and backing up content.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Win-Vind: Vim powers with speed of thought in Windows 11
5 projects | news.ycombinator.com | 11 Nov 2023
DHT crawler
2 projects | /r/DataHoarder | 26 Jun 2023
wheel navigation : outline for md,org,folds with completion
1 project | /r/neovim | 18 Apr 2023
Currently highlighted search result has its color turned to normal
3 projects | /r/vim | 31 Jan 2023
Fuzzy search
5 projects | /r/vim | 27 Jan 2023

A Unix-style personal search engine and web crawler for your digital footprint

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Search File Sharing and Synchronization Bookmarks and Link Sharing Rollup Vim
Post date: 26 Jul 2021

falcon

apollo

SurveyJS

ripgrep-all

readability

dogsheep-beta

Shiori

hn-search

WorkOS

parser

go-find-hexagonal

zotero

notational-fzf-vim

Camlistore

Related posts

A Unix-style personal search engine and web crawler for your digital footprint

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Search File Sharing and Synchronization Bookmarks and Link Sharing Rollup Vim Post date: 26 Jul 2021

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Search File Sharing and Synchronization Bookmarks and Link Sharing Rollup Vim
Post date: 26 Jul 2021