We are digitisers at the Natural History Museum in London, on a mission to digitise 80 million specimens and free their data to the world. Ask us anything!

This page summarizes the projects mentioned and recommended in the original post on /r/datasets

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • dwc

    Darwin Core standard for sharing of information about biological diversity.

  • With regards to longevity, when we're planning our infrastructure and how we're actually going to store our digital data we have to think in the long, long term (100+ years), much as we have to when considering how to store the physical specimens. Currently we manage our own data centre which stores all our collections and image data but we’re exploring cloud options currently. In terms of how we store the actual data, we try to map to well known standards and ontologies (such as Darwin Core - https://dwc.tdwg.org/) to ensure our data is interoperable with others and can be managed using community standards. On the Data Portal specifically, we use a versioning system to make sure that data is available long term, even if it’s been changed since it was originally made public (this happens regularly as taxonomists love to reclassify specimens!). This is particularly important when users cite our data using DOIs which should be persistent and always available.

  • CKAN

    CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

  • We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Activeloop Hub

    Discontinued Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake] (by activeloopai)

  • A bit of a shameless plug and a question/offer. My team and I at https://github.com/activeloopai/Hub have created a way to make unstructured dataset of any size accessible from any machine at any scale, and seamlessly stream data to machine learning frameworks like PyTorch and TF, as if it were local. We've seen huge success with publicizing Waymo's dataset, and other major ones we will be sharing very soon. The main benefit here is to make sure actual users are able to work without the hassle of downloading the entire dataset (and sounds like it would also help you in capturing information from specimen images and their labels).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts