pyWhat VS DataProfiler

Compare pyWhat vs DataProfiler and see what are their differences.

pyWhat

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! πŸ§™β€β™€οΈ (by bee-san)

DataProfiler

What's in your data? Extract schema, statistics and entities from datasets (by capitalone)
Our great sponsors
  • Activeloop.ai - Optimize your datasets for ML
  • Scout APM - A developer's best friend. Try free for 14-days
  • Nanos - Run Linux Software Faster and Safer than Linux with Unikernels
pyWhat DataProfiler
8 24
4,598 654
- 2.8%
9.5 9.4
4 days ago 10 days ago
Python Python
MIT License Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

pyWhat

Posts with mentions or reviews of pyWhat. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-06-22.
  • Is there an application or way to find hashes?
    reddit.com/r/HowToHack | 2021-06-24
    Do you mean something like this: https://github.com/bee-san/pyWhat
  • Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is
    reddit.com/r/OSINT | 2021-06-16
    reddit.com/r/Python | 2021-06-16
  • IT Pro Tuesday #155 - Carrier Lookup, Network Podcast, Identification Tool & More
    pyWhat enables you to easily identify emails, IP addresses and more. Feed it a .pcap file or some mysterious text or hex of a file, and it will tell you what it is. The tool is recursive, so it can identify everything in text, files and more. A shout out to the tool's author for sharing his creation.
  • PyWhat: Identify Anything
    news.ycombinator.com | 2021-06-16
  • pyWhat - the easiest way to identify anything
    news.ycombinator.com | 2021-05-31

DataProfiler

Posts with mentions or reviews of DataProfiler. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-10-07.
  • Show HN: Graphsignal – Production Model Monitoring
    news.ycombinator.com | 2021-10-07
    We built a very similar application internally with our open source library: https://github.com/capitalone/dataprofiler

    Effectively, you can monitor changes between profiles:

    # Load a CSV file

  • Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON
    news.ycombinator.com | 2021-08-24
    Not exactly the same, but we wrote a library to easily load any delimited type of file and finds header (even if not first row). It also works to load JSON, Parquet, AVRO and loads it into a dataframe. Not CLI exactly, but pretty easy:

    https://github.com/capitalone/dataprofiler

    Anyway, pretty interesting Miller CLI

  • Launch HN: Lightly (YC S21): Label only the data which improves your ML model
    news.ycombinator.com | 2021-08-09
    Having built a model to identify sensitive data having a solid data labeling solution would be awesome.

    https://github.com/capitalone/DataProfiler

    In this space, Prodigy really dominates:

    https://prodi.gy/

    We actually built our own internal system which integrates and can export the labels (does predictive labeling, etc). Of course, we only focused on text data at the moment.

  • DataProfiler - What's in your data? Extract schema, stats and entities
    reddit.com/r/Python | 2021-07-30
    We made a library called DataProfiler - designed to replace pandas-profiling.
  • DataProfiler – What's in your data? Extract schema, stats and entities
  • Launch HN: Exams, tasks, K8, eCommerce, cell sites, health, travel, data quality
    news.ycombinator.com | 2021-07-23
    Telm.ai (YC S21) - Real-time data quality monitoring

    Looks interesting! I worked on https://github.com/capitalone/DataProfiler

    We are looking to monitor correlation changes over time, see if sensitive data gets entered, track schema changes, etc and see the impact of down stream modeling, etc

    I'm curious how heavy the input is? because usually these systems take a lot of effort to setup. Any idea?

  • Modeling Libraries Don’t Matter
    news.ycombinator.com | 2021-07-22
    My team and I wrote an NLP application to detect sensitive data and detect / validate schemas, etc as well as the other items provided by pandas-profiling.

    https://github.com/capitalone/DataProfiler

    That being said, we noted the same thing. It shouldn't matter what modeling you use. It's the data pipelining where 99% of the work typically is. Modeling itself always needs the same basic input -- matrix of data and outputs a matrix of data.

    Some libraries are good at specific components. Others have improved speeds ups, etc. But it's all so new it's effectively going to change month-to-month. So I always tell the team to build what you can as fast as you can, with the tools you have. We can always update it later, once the pipeline is in place.

  • PyWhat: Identify Anything
    news.ycombinator.com | 2021-06-16
    We built a similar tool, utilizing a CNN. It works on structured (and unstructured) data and provides additional info.

    https://github.com/capitalone/DataProfiler

    Cool part, is you can β€œextend” the intern name-entity recognition model by refitting with the new data.

    Out if the box, the DataProfiler does something like 18 entities including most of the PII dada.

  • Ask HN: What is the best tool to infer data type of tabular data?
    news.ycombinator.com | 2021-06-08
    [1] - https://github.com/capitalone/DataProfiler
  • DataProfiler - What's in your data? Extract schema, statistics and entities from datasets

What are some alternatives?

When comparing pyWhat and DataProfiler you can also consider the following projects:

pandas-profiling - Create HTML profiling reports from pandas DataFrame objects

arkime - Arkime (formerly Moloch) is an open source, large scale, full packet capturing, indexing, and database system.

XlsxWriter - A Python module for creating Excel XLSX files.

BruteShark - Network Analysis Tool

chepy - Chepy is a python lib/cli equivalent of the awesome CyberChef tool.

ViperMonkey - A VBA parser and emulation engine to analyze malicious macros.

rawsec-cybersecurity-inventory - An inventory of tools and resources about CyberSecurity that aims to help people to find everything related to CyberSecurity.

usaddress - :us: a python library for parsing unstructured United States address strings into address components

visions - Type System for Data Analysis in Python