DikeDataset
data-centric-ai
DikeDataset | data-centric-ai | |
---|---|---|
2 | 1 | |
78 | 1,070 | |
- | 1.4% | |
0.0 | 0.0 | |
10 months ago | 5 months ago | |
TeX | TeX | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
DikeDataset
-
Dataset with labeled benign and malicious files
[2] DikeDataset
Hi, Reddit, During the project implementation for my bachelor's thesis [1], a software (named dike, as the Greek goddess of justice) capable of analyzing malicious programs using artificial intelligence techniques, I was unable to locate an open source dataset with labeled malware samples in the public domain. As a result, I created DikeDataset, a dataset with labeled PE and OLE samples [2]. Because it was not the main focus of my thesis, the samples attributes are not evenly distributed (the benign-malicious and OLE-PE ratios are quite low), but the dataset aided greatly in the research process. This week, I was surprised to see that the public GitHub repository (which was used only for storage, without any promotion on communities like this) gained some organic reach (views, clones and stars). Furthermore, I was thrilled to learn that it was used in a research article published in 2021 [3]! As a result, I'd like to share this project with the community in the hopes that it will be useful to some members of the community. [1] dike [2] DikeDataset [3] Toward Identifying APT Malware through API System Calls
data-centric-ai
-
[P] Rubrix: Open-source Python framework for NLP data annotation, exploration, and monitoring
In line with initiatives like Data-centric AI (https://https-deeplearning-ai.github.io/data-centric-comp/, https://github.com/HazyResearch/data-centric-ai), we firmly believe that iterating on datasets (finding label errors, dataset slicing, QA, etc.) will become more and more important, and tools for making this easier and involving different roles are needed.
What are some alternatives?
dike - Platform for automatic analysis of malicious applications using artificial intelligence algorithms ⚖️
argilla - Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
CPPE-Dataset - Code for our paper CPPE - 5 (Medical Personal Protective Equipment), a new challenging object detection dataset
pytorch-lightning - The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. [Moved to: https://github.com/PyTorchLightning/pytorch-lightning]
public-apis - A collective list of free APIs
prometheus-spec - Cryptoeconomically-safe trustless high-load computing on top of Bitcoin
theZoo - A repository of LIVE malwares for your own joy and pleasure. theZoo is a project created to make the possibility of malware analysis open and available to the public.
data-centric-AI - A curated, but incomplete, list of data-centric AI resources.
pytorch-lightning - Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). [Moved to: https://github.com/Lightning-AI/lightning]
spaCy - 💫 Industrial-strength Natural Language Processing (NLP) in Python
autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python