db-benchmark vs julia

db-benchmark

reproducible benchmark of database-like ops (by h2oai)

Suggest topics

Source Code

h2oai.github.io

Suggest alternative

Edit details

julia

The Julia Programming Language (by JuliaLang)

julia-language Julia Scientific HPC Numerical Machine Learning programming-language Science HacktoberFest Julialang

Source Code

julialang.org

Suggest alternative

Edit details

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

db-benchmark		julia
	Project
91	Mentions	350
319	Stars	44,469
0.9%	Growth	0.8%
0.0	Activity	10.0
10 months ago	Latest Commit	2 days ago
R	Language	Julia
Mozilla Public License 2.0	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

db-benchmark

Posts with mentions or reviews of db-benchmark. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-01-08.

Database-Like Ops Benchmark
1 project | news.ycombinator.com | 28 Jan 2024
Polars
11 projects | news.ycombinator.com | 8 Jan 2024

Real-world performance is complicated since data science covers a lot of use cases.
If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.
Per this (old) benchmark, there are differences once you get into 500MB+ territory: https://h2oai.github.io/db-benchmark/
DuckDB performance improvements with the latest release
8 projects | news.ycombinator.com | 6 Nov 2023

I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.
[1] https://h2oai.github.io/db-benchmark/
Show HN: SimSIMD vs. SciPy: How AVX-512 and SVE make SIMD cleaner and ML faster
16 projects | news.ycombinator.com | 7 Oct 2023

https://news.ycombinator.com/item?id=33270638 :
> Apache Ballista and Polars do Apache Arrow and SIMD.
> The Polars homepage links to the "Database-like ops benchmark" of {Polars, data.table, DataFrames.jl, ClickHouse, cuDF, spark, (py)datatable, dplyr, pandas, dask, Arrow, DuckDB, Modin,} but not yet PostgresML? https://h2oai.github.io/db-benchmark/ *
LLM -> Vector database: https://en.wikipedia.org/wiki/Vector_database
/? inurl:awesome site:github.com "vector database"
Pandas vs. Julia – cheat sheet and comparison
7 projects | news.ycombinator.com | 17 May 2023

I agree with your conclusion but want to add that switching from Julia may not make sense either.
According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.
For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.
Any faster Python alternatives?
6 projects | /r/learnprogramming | 12 Apr 2023

Same. Numba does wonders for me in most scenarios. Yesterday I've discovered pola-rs and looks like I will add it to the stack. It's API is similar to pandas. Have a look at the benchmarks of cuDF, spark, dask, pandas compared to it: Benchmarks
Pandas 2.0 (with pyarrow) vs Pandas 1.3 - Performance comparison
1 project | /r/datascience | 8 Apr 2023

The syntax has similarities with dplyr in terms of the way you chain operations, and it’s around an order of magnitude faster than pandas and dplyr (there’s a nice benchmark here). It’s also more memory-efficient and can handle larger-than-memory datasets via streaming if needed.
Pandas v2.0 Released
5 projects | news.ycombinator.com | 3 Apr 2023

If interested in benchmarks comparing different dataframe implementations, here is one:
https://h2oai.github.io/db-benchmark/
Database-like ops benchmark
1 project | /r/dataengineering | 16 Feb 2023
Python "programmers" when I show them how much faster their naive code runs when translated to C++ (this is a joke, I love python)
2 projects | /r/ProgrammerHumor | 17 Jan 2023

Bad examples. Both numpy and pandas are notoriously un-optimized packages, losing handily to pretty much all their competitors (R, Julia, kdb+, vaex, polars). See https://h2oai.github.io/db-benchmark/ for a partial comparison.

julia

Posts with mentions or reviews of julia. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-06.

Top Paying Programming Technologies 2024
19 projects | dev.to | 6 Mar 2024

34. Julia - $74,963
Optimize sgemm on RISC-V platform
6 projects | news.ycombinator.com | 28 Feb 2024

I don't believe there is any official documentation on this, but https://github.com/JuliaLang/julia/pull/49430 for example added prefetching to the marking phase of a GC which saw speedups on x86, but not on M1.
Dart 3.3
2 projects | news.ycombinator.com | 15 Feb 2024

3. dispatch on all the arguments
the first solution is clean, but people really like dispatch.
the second makes calling functions in the function call syntax weird, because the first argument is privileged semantically but not syntactically.
the third makes calling functions in the method call syntax weird because the first argument is privileged syntactically but not semantically.
the closest things to this i can think of off the top of my head in remotely popular programming languages are: nim, lisp dialects, and julia.
nim navigates the dispatch conundrum by providing different ways to define free functions for different dispatch-ness. the tutorial gives a good overview: https://nim-lang.org/docs/tut2.html
lisps of course lack UFCS.
see here for a discussion on the lack of UFCS in julia: https://github.com/JuliaLang/julia/issues/31779
so to sum up the answer to the original question: because it's only obvious how to make it nice and tidy like you're wanting if you sacrifice function dispatch, which is ubiquitous for good reason!
Julia 1.10 Highlights
1 project | news.ycombinator.com | 27 Dec 2023

https://github.com/JuliaLang/julia/blob/release-1.10/NEWS.md
Best Programming languages for Data Analysis📊
4 projects | dev.to | 7 Dec 2023

Visit official site: https://julialang.org/
Potential of the Julia programming language for high energy physics computing
10 projects | news.ycombinator.com | 4 Dec 2023

No. It runs natively on ARM.
julia> versioninfo() Julia Version 1.9.3 Commit bed2cd540a1 (2023-08-24 14:43 UTC) Build Info: Official https://julialang.org/ release
Rust std:fs slower than Python
7 projects | news.ycombinator.com | 29 Nov 2023

https://github.com/JuliaLang/julia/issues/51086#issuecomment...
So while this "fixes" the issue, it'll introduce a confusing time delay between you freeing the memory and you observing that in `htop`.
But according to https://jemalloc.net/jemalloc.3.html you can set `opt.muzzy_decay_ms = 0` to remove the delay.
Still, the musl author has some reservations against making `jemalloc` the default:
https://www.openwall.com/lists/musl/2018/04/23/2
> It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use.
With the above-mentioned tunables, this should be mitigated to some extent, but the general "theme" (focusing on e.g. performance vs memory usage) will likely still mean "it's a tradeoff" or "it's no tradeoff, but only if you set tunables to what you need".
Eleven strategies for making reproducible research the norm
1 project | news.ycombinator.com | 25 Nov 2023

I have asked about Julia's reproducibility story on the Guix mailing list in the past, and at the time Simon Tournier didn't think it was promising. I seem to recall Julia itself didnt have a reproducible build. All I know now is that github issue is still not closed.
https://github.com/JuliaLang/julia/issues/34753
Julia as a unifying end-to-end workflow language on the Frontier exascale system
5 projects | news.ycombinator.com | 19 Nov 2023

I don't really know what kind of rebuttal you're looking for, but I will link my HN comments from when this was first posted for some thoughts: https://news.ycombinator.com/item?id=31396861#31398796. As I said, in the linked post, I'm quite skeptical of the business of trying to assess relative buginess of programming in different systems, because that has strong dependencies on what you consider core vs packages and what exactly you're trying to do.
However, bugs in general suck and we've been thinking a fair bit about what additional tooling the language could provide to help people avoid the classes of bugs that Yuri encountered in the post.
The biggest class of problems in the blog post, is that it's pretty clear that `@inbounds` (and I will extend this to `@assume_effects`, even though that wasn't around when Yuri wrote his post) is problematic, because it's too hard to write. My proposal for what to do instead is at https://github.com/JuliaLang/julia/pull/50641.
Another common theme is that while Julia is great at composition, it's not clear what's expected to work and what isn't, because the interfaces are informal and not checked. This is a hard design problem, because it's quite close to the reasons why Julia works well. My current thoughts on that are here: https://github.com/Keno/InterfaceSpecs.jl but there's other proposals also.
Getaddrinfo() on glibc calls getenv(), oh boy
10 projects | news.ycombinator.com | 16 Oct 2023

Doesn't musl have the same issue? https://github.com/JuliaLang/julia/issues/34726#issuecomment...
I also wonder about OSX's libc. Newer versions seem to have some sort of locking https://github.com/apple-open-source-mirror/Libc/blob/master...
but older versions (from 10.9) don't have any lockign: https://github.com/apple-oss-distributions/Libc/blob/Libc-99...

What are some alternatives?

When comparing db-benchmark and julia you can also consider the following projects:

polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust

jax - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

arrow-datafusion - Apache DataFusion SQL Query Engine

NetworkX - Network Analysis in Python

Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Lua - Lua is a powerful, efficient, lightweight, embeddable scripting language. It supports procedural programming, object-oriented programming, functional programming, data-driven programming, and data description.

databend - 𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com

rust-numpy - PyO3-based Rust bindings of the NumPy C-API

DataFramesMeta.jl - Metaprogramming tools for DataFrames

Numba - NumPy aware dynamic Python compiler using LLVM

sktime - A unified framework for machine learning with time series

F# - Please file issues or pull requests here: https://github.com/dotnet/fsharp

db-benchmark vs polars julia vs jax db-benchmark vs arrow-datafusion julia vs NetworkX db-benchmark vs Apache Arrow julia vs Lua db-benchmark vs databend julia vs rust-numpy db-benchmark vs DataFramesMeta.jl julia vs Numba db-benchmark vs sktime julia vs F#

Compare db-benchmark vs julia and see what are their differences.

db-benchmark

julia

db-benchmark

julia

What are some alternatives?