amazon-s3-find-and-forget
data-toolset
amazon-s3-find-and-forget | data-toolset | |
---|---|---|
3 | 1 | |
232 | 1 | |
0.9% | - | |
7.3 | 6.8 | |
8 days ago | about 2 months ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
amazon-s3-find-and-forget
-
Deleting particular data from S3 External Tables
Take a look at this: https://github.com/awslabs/amazon-s3-find-and-forget We use it for GDPR compliance; it will open a file, delete a row and pack it back. It will modify the file so watch out if you are using Glue job bookmarks. Because you are using external tables, the manifest file will also have to be updated with a proper lenght for the new, updated file. If you have hundreds of tables and thousands of files, and you need to do this on a regular basis this would be the scalable solution, but if you have few files honestly I would do it manually
-
Update S3 Files
Have a look at S3 Find and Forget
-
How to handle GDPR requests for data stored in S3 ?
S3 Find and Forget is probably worth looking into, even if just to get ideas on how to implement a similar solution for yourself
data-toolset
What are some alternatives?
DataEngineeringProject - Example end to end data engineering project.
prql-query - Query and transform data with PRQL
isp-data-pollution - ISP Data Pollution to Protect Private Browsing History with Obfuscation
dbd - dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
awesome-aws - A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources. Featuring the Fiery Meter of AWSome.
rill - Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.
s3-credentials - A tool for creating credentials for accessing S3 buckets
petastorm - Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
pystore - Fast data store for Pandas time-series data
DataProfiler - What's in your data? Extract schema, statistics and entities from datasets