hate-speech-and-offensive-language
cia
hate-speech-and-offensive-language | cia | |
---|---|---|
2 | 2 | |
779 | 3 | |
- | - | |
1.9 | 0.0 | |
over 1 year ago | over 2 years ago | |
Jupyter Notebook | Jupyter Notebook | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
hate-speech-and-offensive-language
-
How to make a class column for a classifier from sentiment analysis results?
I've used NRCLex to perform sentiment analysis on some Twitter data. I have hate speech classifier code (https://github.com/t-davidson/hate-speech-and-offensive-language/blob/master/classifier/final_classifier.ipynb) I want to pass the dataset through, but before I can I need to have a "class" column for the model. For those not familiar, NRCLex returns scores for 10 emotions: anticipation, joy, anger, fear, surprise, disgust, positive, negative, sadness and trust. The table looks like this (letters denoting emotions):
-
Where do we go from here and who is going to step up to help us?
Some of this exists, and both Quora and Facebook (among others) use it extensively. Both hate speech and porn are good targets for machine learning. It needs supervision, but it can take a lot of load off human moderators.
Open source implementations exist, e.g.:
https://github.com/t-davidson/hate-speech-and-offensive-lang...
I suspect more message board will want to start applying these sooner rather than later. Most have already figured out that they need anti-spam tools, rather than it coming as a surprise when they roll things out and it fills up with bots. The technology is similar.
You mention being able to share that information across boards, and I don't know of any widespread implementation of that. You can, at least, let somebody else handle your authentication, which slightly slows their ability to create new accounts when you blacklist one. I'd like to see those sites distinguish "aged" accounts, so that it at least takes some effort or cost to use a new account.
cia
-
CIA Factbook - 250 countries & 66 Columns of Dataset & API
While you can download and use this dataset for free through https://github.com/woosal1337/cia, you can also prefer using Kaggle Page, whereas both of the pages are going to stay updated to the latest versions accordingly.
For each separate file, folder please click here, if you want to visit the file where all of the columns were combined together (over 66 columns), then please click here.
What are some alternatives?
hashformers - Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).
visuallayer - Simplify Your Visual Data Ops. Find and visualize issues with your computer vision datasets such as duplicates, anomalies, data leakage, mislabels and others.
Tegridy-MIDI-Dataset - Tegridy MIDI Dataset for precise and effective Music AI models creation.
covid19za - Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
toxicity - The world's largest social media toxicity dataset.
shabby-pages - ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
PLOD-AbbreviationDetection - This repository contains the PLOD Dataset for Abbreviation Detection released with our LREC 2022 publication
openbrewerydb - 🍻 An open-source dataset of breweries, cideries, brewpubs, and bottleshops.
ThoughtSource - A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/
whylogs - An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
airline-sentiment-streaming - Streaming with Airline Sentiment. Utilizing Cloudera Machine Learning, Apache NiFi, Apache Hue, Apache Impala, Apache Kudu
DataProfiler - What's in your data? Extract schema, statistics and entities from datasets