go-sstables vs 500lines

go-sstables

Go library for protobuf compatible sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log. Ships now with an embedded key-value store. (by thomasjungblut)

Source Code

blog.thomasjungblut.com

Suggest alternative

Edit details

500lines

500 Lines or Less (by kragen)

Suggest topics

Source Code

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

go-sstables		500lines
	Project
4	Mentions	2
255	Stars	28
-	Growth	-
4.0	Activity	10.0
2 months ago	Latest Commit	almost 10 years ago
Go	Language	Python
Apache License 2.0	License	GNU General Public License v3.0 or later

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

go-sstables

Posts with mentions or reviews of go-sstables. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-08-25.

GitHub - thomasjungblut/go-sstables: Go library for protobuf compatible sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log. Ships now with an embedded key-value store.
1 project | /r/databasedevelopment | 25 Aug 2022
I want to dive into how to make search engines
16 projects | news.ycombinator.com | 25 Aug 2022

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.
What's the big deal about key-value databases like FoundationDB ands RocksDB?
6 projects | news.ycombinator.com | 23 Aug 2022

I highly recommend people comfortable with Go checkout the building blocks at https://github.com/thomasjungblut/go-sstables
This codebase shows how SSTables, WAL, memtables, skiplists, segment files, and plenty of other storage engine components work in a digestible way. Includes a demo database showing how it all comes together.
Understanding LSM Trees: What Powers Write-Heavy Databases
3 projects | news.ycombinator.com | 9 Feb 2022

500lines

Posts with mentions or reviews of 500lines. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-07-21.

Ask HN: What are some 'cool' but obscure data structures you know about?
54 projects | news.ycombinator.com | 21 Jul 2022

I'm thinking that the bar for "obscure" here is maybe pretty low if bloom filters pass?
Here's a few fun ones.
— ⁂ —
The Burrows-Wheeler transform is the second character of each suffix in a (cyclic) suffix array, and there turns out to be a simple algorithm to recover the original text from this transform; live demo: http://canonical.org/~kragen/sw/bwt. But the new text is very easily compressible, which is the basis for bzip2. As mentioned by Blahah and dahart, this is only one of many interesting things you can do with suffix arrays; they can also do efficient regular expression search, for example. This is more interesting since three practical linear-time and -space suffix array construction algorithms were discovered in the early 02000s.
— ⁂ —
Pentti Kanerva devised a "fully distributed representation" in the 01990s, which has some features in common with bloom filters — every association stored in an FDR record is stored to an equal extent (probabilistically) in every bit of the record, so erasing some of the bits erases none of the associations but increases the error probability for all of them. You assign a random or pseudorandom bitvector (of, say, 16384 bits) to every atomic value to be stored, represent an association of two (or more) atoms as their XOR, and then take the bitwise majority rule of all the associations to be stored as the record. To retrieve an association, you XOR the record with an atom, then compute the correlation of the XOR result with all possible atoms; the correlation with the correct atoms will be many standard deviations away from the mean, unless the number of associations stored in the record is high enough that conventional data structures wouldn't have been able to store it either.
This and some related work has blossomed into a field called "hyperdimensional computing" and "vector symbolic architectures". But the problem of finding the nearest atom or atoms seems, on its face, to make this impractical: if you have 16384 possible atoms and 16384 bits in your bitvector, you would seem to need at least 16777216 bit operations to compute all the associations.
I realized the other night that if you use the rows of a Walsh–Hadamard matrix as the "pseudorandom" atom representations, you can compute all the correlations in linearithmic time with the fast Walsh–Hadamard transform. Moreover, Cohn and Lempel's result from 01977 "on fast M-sequence transforms" is that after a simple transposition you can also use the FWHT to compute all the correlations with the complete output of a maximal LFSR (a so-called "M-sequence") in linearithmic time, so you can also use the cyclic shifts of an LFSR output for your atoms without losing efficiency. Probably someone else has already figured this out previously, but I hadn't found it in the literature.
— ⁂ —
I'm not sure if the quantities used in reduced affine arithmetic count as a "data structure"? They're a data type, at least, but they're not a container. The idea is that you execute some arbitrary numerical algorithm, not merely on floating-point numbers, or even on dual numbers as in forward-mode automatic differentiation, nor even yet on (min, max) intervals as in ordinary interval arithmetic, but on a vector of coefficients [a₀, a₁, a₂...aₙ], which represents a linear function a₀x₀ + a₁x₁ + a₂x₂ + ... aₙxₙ, where x₀ = 1 and each of the other xᵢ ∈ [-1, 1], so this function is really a representation of an interval. Then you can assign each of the xᵢ (except x₀ and xₙ) to be the unknown error in one of the input parameters to your algorithm, whose magnitude is determined by the aᵢ value in the value of that parameter; and when you introduce rounding error, you increase aₙ enough to account for it. (The non-reduced kind of affine arithmetic just increases n without limit to account for new roundoff errors.)
— ⁂ —
Reduced affine arithmetic has two advantages over the usual kind of interval arithmetic. First, it cancels errors correctly; if you calculate (y+1)(y+2)/(y+3) with interval arithmetic, with some uncertainty in the value of y, you usually get an unnecessarily loose bound on the result because interval arithmetic assumes that the uncertainties in the numerator and the denominator are uncorrelated, when in fact for many values of y they partly cancel out. (You can't apply this cancelation to aₙ, the error stemming from all the roundoffs in your algorithm, just the other aᵢ.) But the more interesting result is that RAA gives you a linear function that approximates the calculation result as a linear function of the inputs, which is pretty similar to calculating the gradient — but with error bounds on the approximation!
— ⁂ —
LSM trees are only about as "obscure" as bloom filters, since they underlie several prominent filesystems, LevelDB, and most of the world's search engines, but if you don't know about bloom filters, you might not know about LSM trees either. I wrote a tutorial explanation of LSM trees in https://github.com/kragen/500lines/tree/master/search-engine, though the name "LSM tree" hadn't yet gone mainstream.
Binary array sets, as described in https://www.nayuki.io/page/binary-array-set, are closely related to LSM trees, to the point that you could describe them as an application of LSM trees. They are a simple data structure for the exact version of the problem bloom filters solve approximately: maintaining a set S of items supporting the operations S.add(item) and S.contains?(item). Insertion (.add) is amortized constant time, and membership testing (.contains?) is O(log² N).
— ⁂ —
A different and more efficient data structure for the exact set membership problem, limited to the case where the items to be stored are up to M integers in the range [0, N), supports constant-time creation, clearing, insertion, size measurement, and membership testing, and linear-time enumeration, but I think cannot be implemented in modern standard C because of its rules about reading uninitialized data. (By allowing creation to be linear time you can implement it in standard C.) It uses two arrays, A[M] and B[N] and a variable n, and i is a member of the set iff B[i] < n ∧ A[B[i]] == i, from which it is straightforward to work out the other methods.
I don't remember what this data structure is called, but I definitely didn't invent it. I described it and some uses for it in https://dercuano.github.io/notes/constant-time-sets-for-pixe.... You can associate additional values with this structure to use it as a sparse array.
Understanding LSM Trees: What Powers Write-Heavy Databases
3 projects | news.ycombinator.com | 9 Feb 2022
Groom seems to have forgotten to mention the "merge" part of "log-structured merge" trees, and consequently the "log-structured" part too. He does talk about a "compaction process," but he sort of just doesn't mention the process of selecting when and what to compact, or why the compaction process can use purely sequential I/O, which are the crucial aspects of LSM-tree performance.
My unfinished attempt from 02014 to explain LSM-trees (and, in particular, full-text search engines using them) with an implementation in 250 lines of Python 2 is at https://github.com/kragen/500lines/tree/master/search-engine. I think it's a more complete explanation (except for tombstones), but it's longer (half an hour of reading) and there aren't as many illustrations.
On the plus side, you can actually run it:
```
    $ make
```

What are some alternatives?

When comparing go-sstables and 500lines you can also consider the following projects:

search-engines - Reviewing alternative search engines

peg-bootstrap - A PEG that compiles itself.

phalanx - Phalanx is a cloud-native distributed search engine that provides endpoints through gRPC and traditional RESTful API.

multiversion-concurrency-contro

pytai - Kaitai Struct: Visualizer and Hex Viewer GUI in Python

hse - HSE: Heterogeneous-memory storage engine

mitta-screenshot - Mitta's Chrome extension for saving the current view of a website.

search-lib - A library of classes which can be used to build a search engine.

jina - ☁️ Build multimodal AI applications with cloud-native stack

MeiliSearch - A lightning-fast search API that fits effortlessly into your apps, websites, and workflow

sonic - 🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

now - 🧞 No-code tool for creating a neural search solution in minutes

go-sstables vs search-engines 500lines vs peg-bootstrap go-sstables vs phalanx 500lines vs multiversion-concurrency-contro go-sstables vs pytai go-sstables vs hse go-sstables vs mitta-screenshot go-sstables vs search-lib go-sstables vs jina go-sstables vs MeiliSearch go-sstables vs sonic go-sstables vs now

Compare go-sstables vs 500lines and see what are their differences.

go-sstables

500lines

go-sstables

500lines

What are some alternatives?