Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
A few years ago, I was challenged with a task: given a database table with production data, generating fake data with the same structure but resembling most of the probability distributions, inter-column dependencies, and keeping the compression ratios.
This task was very difficult to solve. Either I was getting something too random, not anonymized enough, or too slow.
After experimenting with five different methods (explicit distributions, Markov models, Feistel Networks, LSTM, compressed data mutation), I've implemented it in a tool named `clickhouse-obfuscator`.
It works directly on files and is not dependent on the particular database: it can work with ClickHouse, Snowflake, Redshift, DuckDB, SQLite, or PostgreSQL...
Source code: https://github.com/ClickHouse/ClickHouse/tree/master/program...
Install:
You can also use this tool to amplify the data volume for tests.
For example, based on a dataset of 100 million records from ClickBench, I created a dataset of 100 billion records. Here is a description of how to generate this dataset:
https://github.com/ClickHouse/ClickBench/tree/main/clickhous...
Basically, you train a model on the existing dataset, then run the generator multiple times in parallel with different seeds.