Show HN: Full text search Project Gutenberg (60m paragraphs)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • Appwrite - The Open Source Firebase alternative introduces iOS support
  • SonarQube - Static code analysis for 29 languages.
  • Scout APM - Less time debugging, more time building
  • rum

    RUM access method - inverted index with additional information in posting lists

    I suggest to have a look at https://github.com/postgrespro/rum if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.

  • react-virtualized

    React components for efficiently rendering large lists and tabular data

  • Appwrite

    Appwrite - The Open Source Firebase alternative introduces iOS support . Appwrite is an open source backend server that helps you build native iOS applications much faster with realtime APIs for authentication, databases, files storage, cloud functions and much more!

  • tatoeba2

    Official repository for main codebase for Tatoeba, a multilingual sentence/translation database.

    For the character mappings, it might be useful to have a look at the config for https://tatoeba.org (or rather, the PHP script that generates the config): https://github.com/Tatoeba/tatoeba2/blob/dev/src/Shell/Sphin...

    There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

    For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

    It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.

  • recoll

    recoll with webui in a docker container

    This is really cool. Something like this should exist.

    It seems like you could do it more easily, and have faster search responses, with the following steps:

    1. Mirror the current gutenberg archive (e.g. rsync -av --del aleph.gutenberg.org::gutenberg gutenberg

    2. Install recoll-webui from https://www.lesbonscomptes.com/recoll/pages/recoll-webui-ins... or using docker-recoll-webui: https://github.com/sunde41/recoll

  • gutensearch

    Search engine for Project Gutenberg books

    Thanks! I had the exact same problem and eventually it got me to do something about it. It is particularly bad with writers from antiquity or with a lot of popular appeal.

    I've begun adding to this repository, it'll come in piece by piece as I clean up the code: https://github.com/cordb/gutensearch

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts