Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • PD-Diffusion

  • The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.

    For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.

    If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating

    [0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...

    [1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.

    [2] https://github.com/kmeisthax/PD-Diffusion

    [3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...

  • fastdup

    fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

  • Creators of the data quality tool for computer vision, fastdup, continue to improve on their free release https://github.com/visual-layer/fastdup

    Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Visualize your dataset using DINOv2 embedding

    1 project | news.ycombinator.com | 2 May 2023
  • Visualize your dataset using DINOv2 embedding

    2 projects | /r/computervision | 1 May 2023
  • [R][P] How to extract feature vectors of large datasets using DINOv2 on CPU

    1 project | /r/MachineLearning | 26 Apr 2023
  • Computer Vision pre-trained model for finding how similar two photos of a room are

    2 projects | /r/computervision | 23 Mar 2023
  • Find image duplicates and outliers – A free, scalable, efficient tool

    1 project | /r/computervision | 21 Mar 2023