On Data Quality

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Fake-News-Classification

    Exploration of Natural Language Processing techniques to create a prediction model using the LIAR dataset.

    For our capstone project at Flatiron school we had to not only pitch the project we wanted to do, but also find a dataset that would allow us to accomplish that project. I chose to make a Fake News Classifier, and pitched a dataset I found on Kaggle for it. My instructor was quick in turning it down. It had no information on how that data was acquired, how it was labelled, and I had no means of verifying it. With some more research done I found the Liar dataset, which contained thousands of data points, humanly labelled by editors from politifact.com using a truthiness scale and which contained extensive metadata on each instance, making it verifiable. Once I settled on my final model, I decided to train a version of it on the rejected dataset, just for curiosity. The Accuracy it provided for the test data from that dataset was way higher than the one trained in the Liar dataset. Why was that? The model wasn't actually making correct predictions, it was just better at identifying the labels (which were not verified) from the dataset.

  • liar_dataset

    dataset liar

    For our capstone project at Flatiron school we had to not only pitch the project we wanted to do, but also find a dataset that would allow us to accomplish that project. I chose to make a Fake News Classifier, and pitched a dataset I found on Kaggle for it. My instructor was quick in turning it down. It had no information on how that data was acquired, how it was labelled, and I had no means of verifying it. With some more research done I found the Liar dataset, which contained thousands of data points, humanly labelled by editors from politifact.com using a truthiness scale and which contained extensive metadata on each instance, making it verifiable. Once I settled on my final model, I decided to train a version of it on the rejected dataset, just for curiosity. The Accuracy it provided for the test data from that dataset was way higher than the one trained in the Liar dataset. Why was that? The model wasn't actually making correct predictions, it was just better at identifying the labels (which were not verified) from the dataset.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • awesome-public-datasets

    A topic-centric list of HQ open datasets.

    Awesome Public Datasets; UCI Machine Learning Repository; Recommender Systems and Personalization Datasets; The Stanford Open Policing Project; Labor Force Statistics from the Current Population Survey; Unicef Data; Climate Data; National Centers for Environment Information; Google Cloud Healthcare API public datasets; WHO Data Collections; USA Census Bureau; US Government Open Data

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts