Our great sponsors
Darwin Core standard for sharing of information about biological diversity.
With regards to longevity, when we're planning our infrastructure and how we're actually going to store our digital data we have to think in the long, long term (100+ years), much as we have to when considering how to store the physical specimens. Currently we manage our own data centre which stores all our collections and image data but we’re exploring cloud options currently. In terms of how we store the actual data, we try to map to well known standards and ontologies (such as Darwin Core - https://dwc.tdwg.org/) to ensure our data is interoperable with others and can be managed using community standards. On the Data Portal specifically, we use a versioning system to make sure that data is available long term, even if it’s been changed since it was originally made public (this happens regularly as taxonomists love to reclassify specimens!). This is particularly important when users cite our data using DOIs which should be persistent and always available.
CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
We publish all our data on the [Data Portal](https://data.nhm.ac.uk), a Museum project that's been running since 2014. Instead of MediaWiki it runs on an open-source Python framework called [CKAN](https://ckan.org), which is designed for hosting datasets - though we've had to adapt it in various ways so that it can handle such large amounts of data.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake] (by activeloopai)
A bit of a shameless plug and a question/offer. My team and I at https://github.com/activeloopai/Hub have created a way to make unstructured dataset of any size accessible from any machine at any scale, and seamlessly stream data to machine learning frameworks like PyTorch and TF, as if it were local. We've seen huge success with publicizing Waymo's dataset, and other major ones we will be sharing very soon. The main benefit here is to make sure actual users are able to work without the hassle of downloading the entire dataset (and sounds like it would also help you in capturing information from specimen images and their labels).
Activeloop hub VS TileDB-Py - a user suggested alternative
2 projects | 20 Oct 2021
Tools to host a large DB with GUI frontend
1 project | /r/bioinformatics | 18 May 2023
Data sources episode 2: AWS S3 to Postgres Data Sync using Singer
2 projects | dev.to | 4 May 2023
How do i make a conventor from .csv to .db?
1 project | /r/webdev | 17 Apr 2023
Ask HN: Best way to provide access to large data sets
2 projects | news.ycombinator.com | 11 Apr 2023