DataProfiler Alternatives

Similar projects and alternatives to DataProfiler

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better DataProfiler alternative or higher similarity.

Suggest an alternative to DataProfiler

Reviews and mentions

Posts with mentions or reviews of DataProfiler. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-10-07.
  • Show HN: Graphsignal – Production Model Monitoring | 2021-10-07
    We built a very similar application internally with our open source library:

    Effectively, you can monitor changes between profiles:

    # Load a CSV file

  • Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON | 2021-08-24
    Not exactly the same, but we wrote a library to easily load any delimited type of file and finds header (even if not first row). It also works to load JSON, Parquet, AVRO and loads it into a dataframe. Not CLI exactly, but pretty easy:

    Anyway, pretty interesting Miller CLI

  • Launch HN: Lightly (YC S21): Label only the data which improves your ML model | 2021-08-09
    Having built a model to identify sensitive data having a solid data labeling solution would be awesome.

    In this space, Prodigy really dominates:

    We actually built our own internal system which integrates and can export the labels (does predictive labeling, etc). Of course, we only focused on text data at the moment.

  • DataProfiler - What's in your data? Extract schema, stats and entities | 2021-07-30
    We made a library called DataProfiler - designed to replace pandas-profiling.
  • DataProfiler – What's in your data? Extract schema, stats and entities
  • Launch HN: Exams, tasks, K8, eCommerce, cell sites, health, travel, data quality | 2021-07-23 (YC S21) - Real-time data quality monitoring

    Looks interesting! I worked on

    We are looking to monitor correlation changes over time, see if sensitive data gets entered, track schema changes, etc and see the impact of down stream modeling, etc

    I'm curious how heavy the input is? because usually these systems take a lot of effort to setup. Any idea?

  • Modeling Libraries Don’t Matter | 2021-07-22
    My team and I wrote an NLP application to detect sensitive data and detect / validate schemas, etc as well as the other items provided by pandas-profiling.

    That being said, we noted the same thing. It shouldn't matter what modeling you use. It's the data pipelining where 99% of the work typically is. Modeling itself always needs the same basic input -- matrix of data and outputs a matrix of data.

    Some libraries are good at specific components. Others have improved speeds ups, etc. But it's all so new it's effectively going to change month-to-month. So I always tell the team to build what you can as fast as you can, with the tools you have. We can always update it later, once the pipeline is in place.

  • PyWhat: Identify Anything | 2021-06-16
    We built a similar tool, utilizing a CNN. It works on structured (and unstructured) data and provides additional info.

    Cool part, is you can “extend” the intern name-entity recognition model by refitting with the new data.

    Out if the box, the DataProfiler does something like 18 entities including most of the PII dada.

  • Ask HN: What is the best tool to infer data type of tabular data? | 2021-06-08
    [1] -
  • DataProfiler - What's in your data? Extract schema, statistics and entities from datasets
  • Show HN: DataProfiler – What's in your data? Extract schema, stats and entities | 2021-05-10
  • Show HN: DataProfiler – A Replacement for Pandas-Profiling | 2021-05-04
  • Show HN: DataProfiler – Sensitive Data Detection and Profiling | 2021-05-03
  • Show HN: DataProfiler – Extract schema, statistics and PPI / NPI detection | 2021-04-30
  • Show HN: The DataProfiler – What's in your data? | 2021-04-26


Basic DataProfiler repo stats
8 days ago

capitalone/DataProfiler is an open source project licensed under Apache License 2.0 which is an OSI approved license.

Scout APM: A developer's best friend. Try free for 14-days
Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
Find remote Python jobs at our new job board
There are 10 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.