Show HN: Full text search Project Gutenberg (60m paragraphs)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • rum

    RUM access method - inverted index with additional information in posting lists (by postgrespro)

  • I suggest to have a look at https://github.com/postgrespro/rum if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.

  • react-virtualized

    React components for efficiently rendering large lists and tabular data

  • [2] https://github.com/bvaughn/react-virtualized

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • tatoeba2

    Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.

  • For the character mappings, it might be useful to have a look at the config for https://tatoeba.org (or rather, the PHP script that generates the config): https://github.com/Tatoeba/tatoeba2/blob/dev/src/Shell/Sphin...

    There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

    For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

    It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.

  • recoll

    recoll with webui in a docker container

  • This is really cool. Something like this should exist.

    It seems like you could do it more easily, and have faster search responses, with the following steps:

    1. Mirror the current gutenberg archive (e.g. rsync -av --del aleph.gutenberg.org::gutenberg gutenberg

    2. Install recoll-webui from https://www.lesbonscomptes.com/recoll/pages/recoll-webui-ins... or using docker-recoll-webui: https://github.com/sunde41/recoll

  • gutensearch

    Search engine for Project Gutenberg books

  • Thanks! I had the exact same problem and eventually it got me to do something about it. It is particularly bad with writers from antiquity or with a lot of popular appeal.

    I've begun adding to this repository, it'll come in piece by piece as I clean up the code: https://github.com/cordb/gutensearch

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts