Performance comparison: counting words in Python, Go, C++, C, Awk, Forth, Rust

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • countwords

    Discontinued Playing with counting word frequencies (and performance) in various languages.

  • The source code of the trie program I wrote is here: https://github.com/benhoyt/countwords/blob/master/rust/optim...

    The comments explain why I think it's slower. The TL;DR is that using a trie requires more memory accesses than a hash table (per byte of input), and those memory accesses slow the whole enterprise down.

    But, not all memory accesses are created equal. So perhaps a more clever representation that exploits locality better would do better. "Obvious" techniques for shrinking memory usage made a big difference and brought the performance of the trie program closer to C: https://github.com/benhoyt/countwords/pull/2

  • CPython

    The Python programming language

  • > if the Python solution were unable to leverage its rich and well optimised standard-library

    collections.Counter is a very straightforward collection backed by dict. The code isn't specially optimized, you can write the same code yourself with dict and even avoid (probably a negligible amount of) overhead.

    Now, dict is optimized, but that's a builtin type. You don't need PSL.

    https://github.com/python/cpython/blob/2fe408497e6838b6fb761...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • adix

    An Adaptive Index Library for Nim

  • In Nim there is this more general (multi-thread capable with mmap and not folding punctuation in) as a demo in adix/tests [1].

    For me just now, single thread ran slightly faster than `wc -w` and about 1.5x faster than optimized.c. Scales pretty linearly with thread count, too. This topic was also discussed in (at least) [2], but Knuth-McIlroy does come up a lot.

    [1] https://github.com/c-blake/adix/blob/master/tests/wf.nim

    [2] https://news.ycombinator.com/item?id=24817594

  • wordcount

    Counting words in different programming languages.

  • There is a similar great project here [1]. The performance of Java there is super impressive.

    [1] https://github.com/juditacs/wordcount

  • Here is a similar excercise. Ad-hoc programs in different languages generate a long list of random numbers. How long does it take? https://github.com/posch/generate-random-numbers

  • word_frequency_nim

    The word frequency program, written in simple nim.

  • Thanks for doing a nim version! I've been learning nim, and am very much a beginner, so I thought I'd tackle a simple version [0]. I'd appreciate any critiques... I don't know how idiomatic it is.

    Honestly, the optimized version you created is kind of opaque; I don't want people to think that that's what "typical" nim looks like.

    I'll do a pull request on the project for mine as simple, yours as optimized, if you don't mind.

    [0]: https://github.com/csterritt/word_frequency_nim/blob/master/...

  • raikv

    Persistent key value store, serverless shared memory caching

  • Amusingly, I've done a multi-threaded version of the word counting program in order to test a shm kv store. I needed benchmark that created a lot of cross thread concurrent accesses to keys and I found a blog about this test. My version has serious constraints though, you have to create a shared memory map with enough space to hold all of the keys beforehand, as it doesn't resize the shm kv map as it runs.

    This is the source for it:

    https://github.com/raitechnology/raikv/blob/master/test/ctes...

    The speedup of the multi-threaded version vs the single-threaded version is about linear. The single threaded version uses 2 threads, one to read stdin and one to hash the keys, the 16 threaded version uses one thread to read, 16 to hash.

    $ time ctest -t 1 < ~/data/enwiki-p10p30303

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • KindleClippingsTranslator

    Czytacz slowek

  • Sure here it is[0]. This one is for parsing Kindle clippings but it's quite similar (except no sorting of all words, I was usually using some service for mobi to txt switch before feeding script). I should probably clean it up and switch implementation to english (but don't use it anymore).

    [0] https://github.com/Machiaweliczny/KindleClippingsTranslator/...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts