Show HN: Full text search Project Gutenberg (60m paragraphs)

This page summarizes the projects mentioned and recommended in the original post on

Our great sponsors
  • Appwrite - The Open Source Firebase alternative introduces iOS support
  • SonarQube - Static code analysis for 29 languages.
  • Scout APM - Less time debugging, more time building
  • rum

    RUM access method - inverted index with additional information in posting lists

    I suggest to have a look at if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.

  • react-virtualized

    React components for efficiently rendering large lists and tabular data

  • Appwrite

    Appwrite - The Open Source Firebase alternative introduces iOS support . Appwrite is an open source backend server that helps you build native iOS applications much faster with realtime APIs for authentication, databases, files storage, cloud functions and much more!

  • tatoeba2

    Official repository for main codebase for Tatoeba, a multilingual sentence/translation database.

    For the character mappings, it might be useful to have a look at the config for (or rather, the PHP script that generates the config):

    There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.)

    For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same:

    It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.

  • recoll

    recoll with webui in a docker container

    This is really cool. Something like this should exist.

    It seems like you could do it more easily, and have faster search responses, with the following steps:

    1. Mirror the current gutenberg archive (e.g. rsync -av --del gutenberg

    2. Install recoll-webui from or using docker-recoll-webui:

  • gutensearch

    Search engine for Project Gutenberg books

    Thanks! I had the exact same problem and eventually it got me to do something about it. It is particularly bad with writers from antiquity or with a lot of popular appeal.

    I've begun adding to this repository, it'll come in piece by piece as I clean up the code:

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts