tika-docker
tantivy
tika-docker | tantivy | |
---|---|---|
20 | 18 | |
103 | 5,829 | |
4.9% | - | |
4.1 | 9.3 | |
about 1 month ago | over 2 years ago | |
Shell | Rust | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
tika-docker
- Text Extraction from Documents
- Apache Tika – Extract text and metadata from doc types (the backbone of RAG)
-
Demystifying Text Data with the Unstructured Python Library
If you accept running Java, the Apache Tika is extremely good at parsing content (https://tika.apache.org/)
- Ajuda com Buscador
-
How do you manage and find large amount of files?
Apache Tika can spit out text from lots of formats. I've used it with grep (or rg) to make a small scale searching of local folders. Tika does a really good job at OCR for finding if text is in a file.
-
40 Containers & Counting...
https://tika.apache.org Meta data from things.
- Hosted app to manage server inventory
- Best FOSS (ideally Docker) that can split PDF files ?
- OK, ElasticSearch works, text files are indexed. How about images? Can images be indexed in NextCloud and fulltextsearched?
-
Document Parsing - an unsolved problem?
At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.
tantivy
-
Hey y'all back again w/ the personal, self-hosted search engine
Backend uses tantivy to index the web pages, sqlite3 to hold metadata / crawl queue
- Ask HN: What are some good rust code to read to learn the language?
-
Looking for recommendations of well maintained open source rust codebases that I can look through/contribute to
Tantivy is a very well made library and also follows alot of the best practices if you like search you'll like this: https://github.com/quickwit-inc/tantivy
-
self hosted elasticsearch alternative
tantivy - More of a search engine library than out of the box solution
-
Whats your favourite open source Rust project that needs more recognition?
Tantivy search engine.
-
Is there a library for instant arbitrary text searching?
You could try the Tantivy crate, with an n-gram tokenizer, which would split and index your text in sliding groups of n characters.
-
Zest: a CLI tool for zettelkasten-like note management
I had to look up the "tantivy" that README mentions. https://github.com/tantivy-search/tantivy. Might want to add a link to the project in your README.
-
Are you using Rust at work? If yes, for what?
We're using Rust for a domain-specific search engine. When I first learned Rust some years ago my first thought was that this language is perfect for heavy text processing. IMO, &str is that single killer feature that got me sold :) The search engine that we're building is based on https://github.com/tantivy-search/tantivy.
- Tantivy, a full-text search engine library in Rust inspired by Apache Lucene
-
Tantivy v0.15 released! Now backed by Quickwit Inc.!
Well spotted. Like IPFS, there's a comment about that here: https://github.com/tantivy-search/tantivy/pull/1067#issuecomment-853139923 that points to the distributed wikipedia mirror project https://github.com/ipfs/distributed-wikipedia-mirror/issues/76
What are some alternatives?
Paperless-ng - A supercharged version of paperless: scan, index and archive all your physical documents
sonic - 🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
sist2 - Lightning-fast file system indexer and search tool
tantivy-wasm
spyglass - A personal search engine: Create a searchable library from your personal documents, interests, and more!
pueue - :stars: Manage your shell commands.
yew - Rust / Wasm framework for creating reliable and efficient web applications
neon - Rust bindings for writing safe and fast native Node.js modules.
spacedrive - Spacedrive is an open source cross-platform file explorer, powered by a virtual distributed filesystem written in Rust.
neuron - Future-proof note-taking and publishing based on Zettelkasten (superseded by Emanote: https://github.com/srid/emanote)
self-hosted_docker_setups - A collection of my docker-compose files used to setup self-hosted services on Raspberry Pi 4 running 64-bit Raspberry Pi OS
zk - A plain text note-taking assistant