A Unix-style personal search engine and web crawler for your digital footprint

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • falcon

    Chrome extension for full text history search! (by lengstrom)

  • In the issues someone says that it works even in FF. You just need to change the extension of the file. Tho I didn't try it yet.

    https://github.com/lengstrom/falcon/issues/73#issuecomment-6...

  • apollo

    A Unix-style personal search engine and web crawler for your digital footprint. (by amirgamil)

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • ripgrep-all

    rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

  • Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.

    A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience.

    ---

    Two things I'd mention are:

    1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint

    2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

  • readability

    A standalone version of the readability lib

  • Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.

    A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience.

    ---

    Two things I'd mention are:

    1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint

    2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

  • dogsheep-beta

    Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette

  • My version of this is https://dogsheep.github.io/ - the idea is to pull your digital footprint from various different sources (Twitter, Foursquare, GitHub etc) into SQLite database files, then run Datasette on top to explore them.

    On top of that I built a search engine called Dogsheep Beta which builds a full-text search index across all of the different sources and lets you search in one place: https://github.com/dogsheep/dogsheep-beta

    You can see a live demonstration of that search engine on the Datasette website: https://datasette.io/-/beta?q=dogsheep

  • Shiori

    Simple bookmark manager built with Go

  • It's failed to make the homepage a few times in the past: https://hn.algolia.com/?q=dogsheep - the one time it did make it was this one about Dogsheep Photos: https://news.ycombinator.com/item?id=23271053

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • parser

    📜 Extract meaningful content from the chaos of a web page

  • Sadly not - I'd love it to do that, but the Pocket API doesn't make that available.

    I've been contemplating building an add-on for Dogsheep that can do this for any given URL (from Pocket or other sources) by shelling out to an archive script such as https://github.com/postlight/mercury-parser - I collected some suggestions for libraries to use here: https://twitter.com/simonw/status/1401656327869394945

    That way you could save a URL using Pocket or browser bookmarks or Pinboard or anything else that I can extract saved URLs from an a separate script could then archive the full contents for you.

  • go-find-hexagonal

    Applying what I learned from https://www.youtube.com/watch?v=oL6JBUk6tj0

  • zotero

    Zotero is a free, easy-to-use tool to help you collect, organize, annotate, cite, and share your research sources.

  • notational-fzf-vim

    Notational velocity for vim.

  • I use nb https://github.com/xwmx/nb for both bookmarks and notetaking. nb downloads a shallow copy of the link and stores it along with the bookmark.

    All notes (and consequently bookmarks and their contents) are stored as plain-text markdown files - so there's no dependency on a proprietary format, and all the content becomes searchable.

    If you're a vim-user, you can also get the notational-fzf-vim plugin (https://github.com/alok/notational-fzf-vim) and point it to the notes/bookmarks folder, and have full fuzzy search over all the content.

  • Camlistore

    Perkeep (née Camlistore) is your personal storage system for life: a way of storing, syncing, sharing, modelling and backing up content.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts