sqlite_scanner
db-benchmark
Our great sponsors
sqlite_scanner | db-benchmark | |
---|---|---|
4 | 91 | |
184 | 319 | |
2.2% | 0.9% | |
8.3 | 0.0 | |
15 days ago | 10 months ago | |
C | R | |
MIT License | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
sqlite_scanner
-
DuckDB 0.7.0
It's not a dumb question at all. I'm pretty knowledgeable with DBs and still find it very difficult to understand how many of these front-end/pass-through engines work.
Checkout Postgres Foreign Data Wrappers. That might be the most well known approach for accessing one database through another. The Supabase team wrote an interesting piece about this recently.
https://supabase.com/blog/postgres-foreign-data-wrappers-rus...
You might also want to try out duckdb's approach to reading other DBs (or DB files). They talk about how they can "import" a sqlite DB in the above 0.7.0 announcement, but also have some other examples in their duckdblabs github project. Check out their "...-scanner" repos:
https://github.com/duckdblabs/postgres_scanner
https://github.com/duckdblabs/sqlite_scanner
-
Notes on the SQLite DuckDB Paper
DuckDB can actually read SQLite or Postgres directly! In the SQLite case, something like Litestream plus DuckDB could work really well!
Also, with Pyarrow's help, DuckDB can already do this with Delta tables!
https://github.com/duckdblabs/sqlite_scanner
https://github.com/duckdblabs/postgresscanner
-
Friendlier SQL with DuckDB
Excellent question! I'll jump in - I am a part of the DuckDB team though, so if other users have thoughts it would be great to get other perspectives as well.
First things first - we really like quite a lot about the SQLite approach. DuckDB is similarly easy to install and is built without dependencies, just like SQLite. It also runs in the same process as your application just like SQLite does. SQLite is excellent as a transactional database - lots of very specific inserts, updates, and deletes (called OLTP workloads). DuckDB can also read directly out of SQLite files as well, so you can mix and match them! (https://github.com/duckdblabs/sqlitescanner)
DuckDB is much faster than SQLite when doing analytical queries (OLAP) like when calculating summaries or trends over time, or joining large tables together. It can use all of your CPU cores for sometimes ~100x speedup over SQLite.
DuckDB also has some enhancements with respect to data transfer in and out of it. It can natively read Pandas, R, and Julia dataframes, and can read parquet files directly also (meaning without inserting first!).
Does that help? Happy to add more details!
- DuckDB extension to read SQLite databases
db-benchmark
- Database-Like Ops Benchmark
-
Polars
Real-world performance is complicated since data science covers a lot of use cases.
If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.
Per this (old) benchmark, there are differences once you get into 500MB+ territory: https://h2oai.github.io/db-benchmark/
-
DuckDB performance improvements with the latest release
I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.
[1] https://h2oai.github.io/db-benchmark/
-
Show HN: SimSIMD vs. SciPy: How AVX-512 and SVE make SIMD cleaner and ML faster
https://news.ycombinator.com/item?id=33270638 :
> Apache Ballista and Polars do Apache Arrow and SIMD.
> The Polars homepage links to the "Database-like ops benchmark" of {Polars, data.table, DataFrames.jl, ClickHouse, cuDF, spark, (py)datatable, dplyr, pandas, dask, Arrow, DuckDB, Modin,} but not yet PostgresML? https://h2oai.github.io/db-benchmark/ *
LLM -> Vector database: https://en.wikipedia.org/wiki/Vector_database
/? inurl:awesome site:github.com "vector database"
-
Pandas vs. Julia โ cheat sheet and comparison
I agree with your conclusion but want to add that switching from Julia may not make sense either.
According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.
For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.
-
Any faster Python alternatives?
Same. Numba does wonders for me in most scenarios. Yesterday I've discovered pola-rs and looks like I will add it to the stack. It's API is similar to pandas. Have a look at the benchmarks of cuDF, spark, dask, pandas compared to it: Benchmarks
-
Pandas 2.0 (with pyarrow) vs Pandas 1.3 - Performance comparison
The syntax has similarities with dplyr in terms of the way you chain operations, and itโs around an order of magnitude faster than pandas and dplyr (thereโs a nice benchmark here). Itโs also more memory-efficient and can handle larger-than-memory datasets via streaming if needed.
-
Pandas v2.0 Released
If interested in benchmarks comparing different dataframe implementations, here is one:
https://h2oai.github.io/db-benchmark/
- Database-like ops benchmark
-
Python "programmers" when I show them how much faster their naive code runs when translated to C++ (this is a joke, I love python)
Bad examples. Both numpy and pandas are notoriously un-optimized packages, losing handily to pretty much all their competitors (R, Julia, kdb+, vaex, polars). See https://h2oai.github.io/db-benchmark/ for a partial comparison.
What are some alternatives?
ibis - the portable Python dataframe library
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
react-native-quick-sqlite - Fast SQLite for react-native.
datafusion - Apache DataFusion SQL Query Engine
ClickBench - ClickBench: a Benchmark For Analytical Databases
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
sqlitestudio - A free, open source, multi-platform SQLite database manager.
databend - ๐๐ฎ๐๐ฎ, ๐๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ & ๐๐. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
duckdb - DuckDB is an in-process SQL OLAP Database Management System
DataFramesMeta.jl - Metaprogramming tools for DataFrames
budibase - Budibase is an open-source low code platform that helps you build internal tools in minutes ๐
sktime - A unified framework for machine learning with time series