-
With regards to longevity, when we're planning our infrastructure and how we're actually going to store our digital data we have to think in the long, long term (100+ years), much as we have to when considering how to store the physical specimens. Currently we manage our own data centre which stores all our collections and image data but we’re exploring cloud options currently. In terms of how we store the actual data, we try to map to well known standards and ontologies (such as Darwin Core - https://dwc.tdwg.org/) to ensure our data is interoperable with others and can be managed using community standards. On the Data Portal specifically, we use a versioning system to make sure that data is available long term, even if it’s been changed since it was originally made public (this happens regularly as taxonomists love to reclassify specimens!). This is particularly important when users cite our data using DOIs which should be persistent and always available.
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
CKAN
CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.
-
Activeloop Hub
Discontinued Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake] (by activeloopai)
A bit of a shameless plug and a question/offer. My team and I at https://github.com/activeloopai/Hub have created a way to make unstructured dataset of any size accessible from any machine at any scale, and seamlessly stream data to machine learning frameworks like PyTorch and TF, as if it were local. We've seen huge success with publicizing Waymo's dataset, and other major ones we will be sharing very soon. The main benefit here is to make sure actual users are able to work without the hassle of downloading the entire dataset (and sounds like it would also help you in capturing information from specimen images and their labels).
Related posts
-
Activeloop hub VS TileDB-Py - a user suggested alternative
2 projects | 20 Oct 2021 -
Open Source takes center stage at United Nations
-
CLI tool and Python library for manipulating SQLite databases
-
AI Strategy Guide: How to Scale AI Across Your Business
-
Little Data: How do we query personal data? (2013)