web-tables-demo
puffin
web-tables-demo | puffin | |
---|---|---|
3 | 1 | |
14 | 277 | |
- | 0.0% | |
0.8 | 7.6 | |
about 1 year ago | about 1 year ago | |
- | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
web-tables-demo
-
Solving the 100GB Game Download
We've recently made a similar thing for databases: https://github.com/ClickHouse/web-tables-demo
It downloads the database on demand and optionally caches on the client side: https://presentations.clickhouse.com/meetup73/new_features/#...
But it is not a new idea; for example, it was already implemented for SQLite a long time ago.
-
Cloud Backed SQLite
I recommend checking a similar project - storing ClickHouse databases on GitHub pages for serverless analytical queries: https://github.com/ClickHouse/web-tables-demo
-
Throwing lots of data at DuckDB and Athena
I've added a benchmark of ClickHouse and Athena to ClickBench:
https://pastila.nl/?0198061e/f2e0e7b2d61d0fe322607b58fc7200b...
When ClickHouse operates in a "data lake" mode - simply by processing a bunch of parquet files on S3. Obviously it is faster than Athena. But I also want to add Presto, Trino, Spark, Databricks, Redshift Spectrum, and Boilingdata, that are currently missing from the benchmark.
Please help me adding them: https://github.com/ClickHouse/ClickBench
Also, it includes another mode of ClickHouse, named "web" - MergeTree tables hosted on a HTTP server (which is more efficient than parquet). See https://github.com/ClickHouse/web-tables-demo
About R2 - it is currently slow, and also incompatible with S3 (e.g., no multipart uploads).
puffin
-
Throwing lots of data at DuckDB and Athena
[3] https://github.com/sutoiku/puffin
One possible thing to look into would be whether this dataset is partitioned too much. My understanding is that the recommended file size for individual parquet files is 512MB to 1GB, whereas here they are 50MB. It would be interesting to see the impact of the partitioning strategy on these benchmarks.
[4] https://parquet.apache.org/docs/file-format/configurations/
What are some alternatives?
transcript.fish - Unofficial No Such Thing As A Fish episode transcripts.
ClickBench - ClickBench: a Benchmark For Analytical Databases
s3fs-fuse - FUSE-based file system backed by Amazon S3
win10-storage-spaces - Scripts for creating an destroying Fault Tolerant multi-tier storage spaces on Windows 10
sql.js-httpvfs - Hosting read-only SQLite databases on static file hosters like Github Pages