CKAN
Activeloop Hub
DISCONTINUED
Our great sponsors
CKAN | Activeloop Hub | |
---|---|---|
5 | 31 | |
3,823 | 4,807 | |
1.0% | - | |
9.8 | 9.9 | |
6 days ago | 8 months ago | |
Python | Python | |
GNU General Public License v3.0 or later | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
CKAN
-
Metadata Store - Which one to Choose ? OpenMetadata vs Datahub ?
We use Kubernetes as our deployment platform. Any feedback on one of these open source data catalogs ? - https://atlas.apache.org/#/ - https://opendatadiscovery.org/ - https://open-metadata.org/ - https://marquezproject.github.io/marquez/ - https://datahubproject.io/ - https://www.amundsen.io/ - https://ckan.org/ - https://magda.io/
-
How to start Data Science and Machine Learning Career?
Ckan
-
We are digitisers at the Natural History Museum in London, on a mission to digitise 80 million specimens and free their data to the world. Ask us anything!
We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.
Activeloop Hub
-
[D] NLP has HuggingFace, what does Computer Vision have?
u/Remote_Cancel_7977 we just launched 100+ computer vision datasets via Activeloop Hub yesterday on r/ML (#1 post for the day!). Note: we do not intend to compete with HuggingFace (we're building the database for AI). Accessing computer vision datasets via Hub is much faster than via HuggingFace though, according to some third-party benchmarks. :)
-
[N] [P] Access 100+ image, video & audio datasets in seconds with one line of code & stream them while training ML models with Activeloop Hub (more at docs.activeloop.ai, description & links in the comments below)
u/gopietz good question. htype="class_label" will work, but querying doesn't support multi-dimensional labels yet. Would you mind opening an issue requesting that feature?
We've recently added a Huggingface integration that allows ingestion of HuggingFace datasets.
-
[P] Database for AI: Visualize, version-control & explore image, video and audio datasets
Hub, our open-source package, lets you stream datasets while training to PyTorch/TensorFlow. Check out how we achieved 95% GPU utilization while training on ImageNet at 50% less cost. We're building the Database for AI, with everything it should contain. If there's an adjacent feature that would make it more useful for your workflow, do let us know!
Our early users love the tool and I hope you'll love it too. We have many more features other than visualization on the roadmap (the current feature list includes querying, version control UI, and integrates through our open-source package Hub (dataset format for AI) with TensorFlow, PyTorch, Sagemaker, other tools on the roadmap.
Please take a look at our open-source dataset format https://github.com/activeloopai/hub and a tutorial on htypes https://docs.activeloop.ai/how-hub-works/visualization-and-htype
The platform allows to: - Inspect the data with all its bounding boxes, masks, etc, and have important stats such as distribution of the labels (adding more stuff in the future to fight bias and improve data quality). - Query datasets to create new, highly specific ones - Version control datasets (while visualizing the changes). I'm confident that if you've ever worked on iteratively improving your models, dataset versioning is probably something you've done. - Stream computer vision datasets while training in PyTorch/Tensorflow via Hub, our open source package (we might add an even more straightforward way to the UI). - For larger organizations access management is important, and we do take care of that.
The visualization interfaces with our open-source dataset format for AI, enabling workflows such as querying/filtering to create datasets/inspect subsamples, tracking changes to the data with data version control visualization (e.g. cross-referencing if the transformations applied had intended effects), and will have integrations with other tools (e.g. experiment tracking, labelling) very soon.
Yes, we're not entirely relevant for your use case, especially if the data is not that big/complex, and benefits that you'd get from switching to Hub format are not as pronounced in case of text as they are in case of computer vision datasets (actually, we still have a couple of diehard NLP community members, but they have ridiculously big text datasets). I presume your university system doesn't use unstructured data like videos/images/audio, either, so our product wouldn't be very helpful in that regard. I do wish you tons of luck and patience though (>10ˆ6?! good Lord...)
-
The hand-picked selection of the best Python libraries released in 2021
Hub.
What are some alternatives?
dvc - 🦉 Data Version Control | Git for Data & Models | ML Experiments Management
ArchivesSpace - The ArchivesSpace archives management tool
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Archivematica - Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.
Access to Memory (AtoM) - Open-source, web application for archival description and public access.
Collective Access: Providence - Cataloguing and data/media management application
petastorm - Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
datasets - TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
CKAN-meta - Metadata files for the CKAN
kaggle-environments
TileDB - The Universal Storage Engine
kuwala - Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times