500lines

500 Lines or Less (by kragen)

500lines Alternatives

Similar projects and alternatives to 500lines

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better 500lines alternative or higher similarity.

500lines reviews and mentions

Posts with mentions or reviews of 500lines. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-07-21.
  • Ask HN: What are some 'cool' but obscure data structures you know about?
    54 projects | news.ycombinator.com | 21 Jul 2022
    I'm thinking that the bar for "obscure" here is maybe pretty low if bloom filters pass?

    Here's a few fun ones.

    — ⁂ —

    The Burrows-Wheeler transform is the second character of each suffix in a (cyclic) suffix array, and there turns out to be a simple algorithm to recover the original text from this transform; live demo: http://canonical.org/~kragen/sw/bwt. But the new text is very easily compressible, which is the basis for bzip2. As mentioned by Blahah and dahart, this is only one of many interesting things you can do with suffix arrays; they can also do efficient regular expression search, for example. This is more interesting since three practical linear-time and -space suffix array construction algorithms were discovered in the early 02000s.

    — ⁂ —

    Pentti Kanerva devised a "fully distributed representation" in the 01990s, which has some features in common with bloom filters — every association stored in an FDR record is stored to an equal extent (probabilistically) in every bit of the record, so erasing some of the bits erases none of the associations but increases the error probability for all of them. You assign a random or pseudorandom bitvector (of, say, 16384 bits) to every atomic value to be stored, represent an association of two (or more) atoms as their XOR, and then take the bitwise majority rule of all the associations to be stored as the record. To retrieve an association, you XOR the record with an atom, then compute the correlation of the XOR result with all possible atoms; the correlation with the correct atoms will be many standard deviations away from the mean, unless the number of associations stored in the record is high enough that conventional data structures wouldn't have been able to store it either.

    This and some related work has blossomed into a field called "hyperdimensional computing" and "vector symbolic architectures". But the problem of finding the nearest atom or atoms seems, on its face, to make this impractical: if you have 16384 possible atoms and 16384 bits in your bitvector, you would seem to need at least 16777216 bit operations to compute all the associations.

    I realized the other night that if you use the rows of a Walsh–Hadamard matrix as the "pseudorandom" atom representations, you can compute all the correlations in linearithmic time with the fast Walsh–Hadamard transform. Moreover, Cohn and Lempel's result from 01977 "on fast M-sequence transforms" is that after a simple transposition you can also use the FWHT to compute all the correlations with the complete output of a maximal LFSR (a so-called "M-sequence") in linearithmic time, so you can also use the cyclic shifts of an LFSR output for your atoms without losing efficiency. Probably someone else has already figured this out previously, but I hadn't found it in the literature.

    — ⁂ —

    I'm not sure if the quantities used in reduced affine arithmetic count as a "data structure"? They're a data type, at least, but they're not a container. The idea is that you execute some arbitrary numerical algorithm, not merely on floating-point numbers, or even on dual numbers as in forward-mode automatic differentiation, nor even yet on (min, max) intervals as in ordinary interval arithmetic, but on a vector of coefficients [a₀, a₁, a₂...aₙ], which represents a linear function ax₀ + ax₁ + ax₂ + ... aₙxₙ, where x₀ = 1 and each of the other xᵢ ∈ [-1, 1], so this function is really a representation of an interval. Then you can assign each of the xᵢ (except x₀ and xₙ) to be the unknown error in one of the input parameters to your algorithm, whose magnitude is determined by the aᵢ value in the value of that parameter; and when you introduce rounding error, you increase aₙ enough to account for it. (The non-reduced kind of affine arithmetic just increases n without limit to account for new roundoff errors.)

    — ⁂ —

    Reduced affine arithmetic has two advantages over the usual kind of interval arithmetic. First, it cancels errors correctly; if you calculate (y+1)(y+2)/(y+3) with interval arithmetic, with some uncertainty in the value of y, you usually get an unnecessarily loose bound on the result because interval arithmetic assumes that the uncertainties in the numerator and the denominator are uncorrelated, when in fact for many values of y they partly cancel out. (You can't apply this cancelation to aₙ, the error stemming from all the roundoffs in your algorithm, just the other aᵢ.) But the more interesting result is that RAA gives you a linear function that approximates the calculation result as a linear function of the inputs, which is pretty similar to calculating the gradient — but with error bounds on the approximation!

    — ⁂ —

    LSM trees are only about as "obscure" as bloom filters, since they underlie several prominent filesystems, LevelDB, and most of the world's search engines, but if you don't know about bloom filters, you might not know about LSM trees either. I wrote a tutorial explanation of LSM trees in https://github.com/kragen/500lines/tree/master/search-engine, though the name "LSM tree" hadn't yet gone mainstream.

    Binary array sets, as described in https://www.nayuki.io/page/binary-array-set, are closely related to LSM trees, to the point that you could describe them as an application of LSM trees. They are a simple data structure for the exact version of the problem bloom filters solve approximately: maintaining a set S of items supporting the operations S.add(item) and S.contains?(item). Insertion (.add) is amortized constant time, and membership testing (.contains?) is O(log² N).

    — ⁂ —

    A different and more efficient data structure for the exact set membership problem, limited to the case where the items to be stored are up to M integers in the range [0, N), supports constant-time creation, clearing, insertion, size measurement, and membership testing, and linear-time enumeration, but I think cannot be implemented in modern standard C because of its rules about reading uninitialized data. (By allowing creation to be linear time you can implement it in standard C.) It uses two arrays, A[M] and B[N] and a variable n, and i is a member of the set iff B[i] < nA[B[i]] == i, from which it is straightforward to work out the other methods.

    I don't remember what this data structure is called, but I definitely didn't invent it. I described it and some uses for it in https://dercuano.github.io/notes/constant-time-sets-for-pixe.... You can associate additional values with this structure to use it as a sparse array.

  • Understanding LSM Trees: What Powers Write-Heavy Databases
    3 projects | news.ycombinator.com | 9 Feb 2022
    Groom seems to have forgotten to mention the "merge" part of "log-structured merge" trees, and consequently the "log-structured" part too. He does talk about a "compaction process," but he sort of just doesn't mention the process of selecting when and what to compact, or why the compaction process can use purely sequential I/O, which are the crucial aspects of LSM-tree performance.

    My unfinished attempt from 02014 to explain LSM-trees (and, in particular, full-text search engines using them) with an implementation in 250 lines of Python 2 is at https://github.com/kragen/500lines/tree/master/search-engine. I think it's a more complete explanation (except for tombstones), but it's longer (half an hour of reading) and there aren't as many illustrations.

    On the plus side, you can actually run it:

        $ make

Stats

Basic 500lines repo stats
2
28
10.0
over 9 years ago

kragen/500lines is an open source project licensed under GNU General Public License v3.0 or later which is an OSI approved license.

The primary programming language of 500lines is Python.


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com