What's in your data? Extract schema, statistics and entities from datasets (by capitalone)


Basic DataProfiler repo stats
3 days ago

capitalone/DataProfiler is an open source project licensed under Apache License 2.0 which is an OSI approved license.

DataProfiler Alternatives

Similar projects and alternatives to DataProfiler

  • GitHub repo sheet2dict

    Simple XLSX and CSV to dictionary converter

  • GitHub repo pandas-profiling

    Create HTML profiling reports from pandas DataFrame objects

  • GitHub repo

    The Python error steamroller.

  • GitHub repo fuckitjs

    The Original Javascript Error Steamroller

  • GitHub repo XlsxWriter

    A Python module for creating Excel XLSX files.

  • GitHub repo pyWhat

    🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

  • GitHub repo chardet

    Python character encoding detector

  • GitHub repo datatable

    A Python package for manipulating 2-dimensional tabular data structures

  • GitHub repo usaddress

    :us: a python library for parsing unstructured address strings into address components

  • GitHub repo probablepeople

    :family: a python library for parsing unstructured western names into name components.

  • GitHub repo hachoir

    Hachoir is a Python library to view and edit a binary stream field by field

  • GitHub repo visions

    Type System for Data Analysis in Python

NOTE: The number of mentions on this list indicates mentions on common posts. Hence, a higher number means a better DataProfiler alternative or higher similarity.


Posts where DataProfiler has been mentioned. We have used some of these posts to build our list of alternatives and similar projects - the last one was on 2021-06-16.
  • PyWhat: Identify Anything | 2021-06-16
    We built a similar tool, utilizing a CNN. It works on structured (and unstructured) data and provides additional info.

    Cool part, is you can “extend” the intern name-entity recognition model by refitting with the new data.

    Out if the box, the DataProfiler does something like 18 entities including most of the PII dada.

  • Ask HN: What is the best tool to infer data type of tabular data? | 2021-06-08
    [1] -
  • DataProfiler - What's in your data? Extract schema, statistics and entities from datasets
  • Show HN: DataProfiler – What's in your data? Extract schema, stats and entities | 2021-05-10
  • Show HN: DataProfiler – A Replacement for Pandas-Profiling | 2021-05-04
  • Show HN: DataProfiler – Sensitive Data Detection and Profiling | 2021-05-03
  • Show HN: DataProfiler – Extract schema, statistics and PPI / NPI detection | 2021-04-30
  • Show HN: The DataProfiler – What's in your data? | 2021-04-26 | 2021-04-26
  • [P] Data Profiler | What's in your data?
    Our team has been working on a python library called the DataProfiler. The main objective was to create a library that could quickly and accurate (cheaply) identify sensitive data (PII/NPI) in datasets.
  • Show HN: Sheet2dict – simple Python XLSX/CSV reader/to dictionary converter | 2021-04-21
    I maintain a similar project, load any CSV, manipulate and get stats, detect sensitive data, etc

    My question, how do you do header detection? That's a _very_ difficult problem.

  • Apt Encounters of the Third Kind | 2021-03-26
    It was actually a response to your comment:

    > One thing I didn't get is this magical PII thing. How does the author look at a random network packet -- nay, just packet headers -- and assign a PII:true/false label? I think many corporations would sacrifice the right hand of a sysadmin if that was the way to get this tech.

    Checkout Amazon macie or Microsoft presidio or try actually using the library I linked?

    It’s usually used in a constrained way, in no way perfect. But it helps investigators track suspected cases of data exfiltration. You can pull something that looks suspect (say a credit card) and compare against an internal dataset and see if it’s legit.

    In the repo I linked you can see the citation for an earlier model on synthetic and real world datasets:

    So I don’t really understand the hostility. | 2021-03-26
  • Which Tech/Library Stack you use in NLP from training to Production
  • What's in your data? We built a library to easily load files and extract schema, statistics and entities | 2021-03-03
    Github repository: