data-centric-ai
DikeDataset
Our great sponsors
data-centric-ai | DikeDataset | |
---|---|---|
1 | 2 | |
1,068 | 77 | |
1.5% | - | |
0.0 | 0.0 | |
5 months ago | 9 months ago | |
TeX | TeX | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
data-centric-ai
-
[P] Rubrix: Open-source Python framework for NLP data annotation, exploration, and monitoring
In line with initiatives like Data-centric AI (https://https-deeplearning-ai.github.io/data-centric-comp/, https://github.com/HazyResearch/data-centric-ai), we firmly believe that iterating on datasets (finding label errors, dataset slicing, QA, etc.) will become more and more important, and tools for making this easier and involving different roles are needed.
DikeDataset
-
Dataset with labeled benign and malicious files
[2] DikeDataset
Hi, Reddit, During the project implementation for my bachelor's thesis [1], a software (named dike, as the Greek goddess of justice) capable of analyzing malicious programs using artificial intelligence techniques, I was unable to locate an open source dataset with labeled malware samples in the public domain. As a result, I created DikeDataset, a dataset with labeled PE and OLE samples [2]. Because it was not the main focus of my thesis, the samples attributes are not evenly distributed (the benign-malicious and OLE-PE ratios are quite low), but the dataset aided greatly in the research process. This week, I was surprised to see that the public GitHub repository (which was used only for storage, without any promotion on communities like this) gained some organic reach (views, clones and stars). Furthermore, I was thrilled to learn that it was used in a research article published in 2021 [3]! As a result, I'd like to share this project with the community in the hopes that it will be useful to some members of the community. [1] dike [2] DikeDataset [3] Toward Identifying APT Malware through API System Calls
What are some alternatives?
argilla - Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
dike - Platform for automatic analysis of malicious applications using artificial intelligence algorithms ⚖️
pytorch-lightning - The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. [Moved to: https://github.com/PyTorchLightning/pytorch-lightning]
CPPE-Dataset - Code for our paper CPPE - 5 (Medical Personal Protective Equipment), a new challenging object detection dataset
prometheus-spec - Cryptoeconomically-safe trustless high-load computing on top of Bitcoin
public-apis - A collective list of free APIs
pytorch-lightning - Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). [Moved to: https://github.com/Lightning-AI/lightning]
theZoo - A repository of LIVE malwares for your own joy and pleasure. theZoo is a project created to make the possibility of malware analysis open and available to the public.
data-centric-AI - A curated, but incomplete, list of data-centric AI resources.
spaCy - 💫 Industrial-strength Natural Language Processing (NLP) in Python
autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python