Performance comparison: counting words in Python, C/C++, Awk, Rust, and more

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

countwords

43 209 5.9 Rust

Discontinued Playing with counting word frequencies (and performance) in various languages.

I don't think the performance is due to start up time at all. I actually cloned the repo, and ran the benchmark and found that Swift's execution time scales drastically with the size of the input.
The benchmark tests each executable by piping in the full King James Bible duplicated 10 times[1] (each copy is 4.13 MB[2]). When I ran it using just a single copy of the input text, the execution time dropped to 58-59 milliseconds, but when I ran the benchmark without modifications it jumped up to over 4 seconds. A hello world script for comparison runs in about 13 milliseconds. The Swift team actually boasts about its quick start up time on the official website [3].
[1] https://github.com/benhoyt/countwords/blob/master/test.sh#L5
[2] https://github.com/benhoyt/countwords/blob/master/kjvbible.t...
[3] https://www.swift.org/server/

CPython

1,319 59,856 10.0 Python

The Python programming language

“Pure Python” commonly means implemented using only the python language. Something written in pure Python ought to be portable across Python implementations. I was merely pointing out that this line
https://github.com/python/cpython/blob/4395ff1e6a18fb26c7a66...
isn’t exactly pure python, because, under a different runtime (eg pypy), the code would take a different path (the “pure python” implementation of _count_elements instead of the C implementation).

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
countwords

1 0 0.0 Rust

Playing with counting word frequencies (and performance) in various languages. (by ClickHouse)
countwords

5 4 2.6 Rust

Playing with counting word frequencies (and performance) in various languages. (by kimono-koans)

In case anyone is interested, I did an optimized, but much more simple, Rust implementation just today[0], which is faster than the optimized implementation on my machine. No indexing into arrays of bytes, etc., no "code golf" measures.
Looks like idiomatic Rust, which I think is interesting. Shows there is more than one way to skin a cat.
[0]: https://github.com/kimono-koans/countwords/blob/master/rust/...

gccontent-benchmark

8 55 0.0 Rust

Benchmarking different languages for a simple bioinformatics task (Counting the GC fraction of DNA in a FASTA file)

Fun stuff! Has run a similar thing with a simple bioinformatics problem before (calculating the ratio of G and Cs against A+G+C+T):
https://github.com/samuell/gccontent-benchmark#readme
Really hard - or impossible - to arrive at a definitive single number for one language, but the whole exercise is a lot of fun and quite informative IMO :)

robin-hood-hashing

23 1,465 0.0 C++

Discontinued Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20

Got a bit better C++ version here which uses a couple libraries instead of std:: stuff - https://gist.github.com/jcelerier/74dfd473bccec8f1bd5d78be5a... ; boost, fmt and https://github.com/martinus/robin-hood-hashing
    $ g++ -I robin-hood-hashing/src/include -O2 -flto -std=c++20 -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -lfmt

countwords

1 1 0.0

Playing with counting word frequencies (and performance) in various languages. (by BurntSushi)

$ git clone -b ag/test-kimono https://github.com/BurntSushi/countwords

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust

2 projects | /r/programming | 15 Mar 2021
I've been loving Benchmarking lately, but the Framework does this one quirky thing with the first result of a set. Specifically, the first return is always unusually high.

1 project | /r/laravel | 6 Dec 2023
Pinpoint performance regressions with CI-Integrated differential profiling

4 projects | dev.to | 23 Oct 2023
If this isn't the perfect data structure, why?

3 projects | /r/C_Programming | 22 Oct 2023
unordered_dense: A Fast & Densely Stored Hashmap And Hashset Based On Robin-Hood Backward Shift Deletion

1 project | /r/programming | 11 Jul 2023

Performance comparison: counting words in Python, C/C++, Awk, Rust, and more

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
C++ Bioinformatics Performance hash-tables Benchmarking
Post date: 24 Jul 2022

countwords

CPython

InfluxDB

countwords

countwords

gccontent-benchmark

robin-hood-hashing

countwords

SaaSHub

Related posts

Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust

I've been loving Benchmarking lately, but the Framework does this one quirky thing with the first result of a set. Specifically, the first return is always unusually high.

Pinpoint performance regressions with CI-Integrated differential profiling

If this isn't the perfect data structure, why?

unordered_dense: A Fast & Densely Stored Hashmap And Hashset Based On Robin-Hood Backward Shift Deletion

Performance comparison: counting words in Python, C/C++, Awk, Rust, and more

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com C++ Bioinformatics Performance hash-tables Benchmarking Post date: 24 Jul 2022

Related posts

Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust

I've been loving Benchmarking lately, but the Framework does this one quirky thing with the first result of a set. Specifically, the first return is always unusually high.

Pinpoint performance regressions with CI-Integrated differential profiling

If this isn't the perfect data structure, why?

unordered_dense: A Fast &amp; Densely Stored Hashmap And Hashset Based On Robin-Hood Backward Shift Deletion

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
C++ Bioinformatics Performance hash-tables Benchmarking
Post date: 24 Jul 2022

unordered_dense: A Fast & Densely Stored Hashmap And Hashset Based On Robin-Hood Backward Shift Deletion