Show HN: A Tool for Data Obfuscation

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

ClickHouse

208 34,153 10.0 C++

ClickHouse® is a free analytics DBMS for big data

A few years ago, I was challenged with a task: given a database table with production data, generating fake data with the same structure but resembling most of the probability distributions, inter-column dependencies, and keeping the compression ratios.
This task was very difficult to solve. Either I was getting something too random, not anonymized enough, or too slow.
After experimenting with five different methods (explicit distributions, Markov models, Feistel Networks, LSTM, compressed data mutation), I've implemented it in a tool named `clickhouse-obfuscator`.
It works directly on files and is not dependent on the particular database: it can work with ClickHouse, Snowflake, Redshift, DuckDB, SQLite, or PostgreSQL...
Source code: https://github.com/ClickHouse/ClickHouse/tree/master/program...
Install:

ClickBench

68 570 9.0 HTML

ClickBench: a Benchmark For Analytical Databases

You can also use this tool to amplify the data volume for tests.
For example, based on a dataset of 100 million records from ClickBench, I created a dataset of 100 billion records. Here is a description of how to generate this dataset:
https://github.com/ClickHouse/ClickBench/tree/main/clickhous...
Basically, you train a model on the existing dataset, then run the generator multiple times in parallel with different seeds.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project