Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

PD-Diffusion

2 1 6.5 Python

The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.
For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.
If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating
[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...
[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.
[2] https://github.com/kmeisthax/PD-Diffusion
[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...

fastdup

18 1,403 9.4 Python

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Creators of the data quality tool for computer vision, fastdup, continue to improve on their free release https://github.com/visual-layer/fastdup
Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Visualize your dataset using DINOv2 embedding

1 project | news.ycombinator.com | 2 May 2023
Visualize your dataset using DINOv2 embedding

2 projects | /r/computervision | 1 May 2023
[R][P] How to extract feature vectors of large datasets using DINOv2 on CPU

1 project | /r/MachineLearning | 26 Apr 2023
Computer Vision pre-trained model for finding how similar two photos of a room are

2 projects | /r/computervision | 23 Mar 2023
Find image duplicates and outliers – A free, scalable, efficient tool

1 project | /r/computervision | 21 Mar 2023

Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
data-curation Dataset Deep Learning image-duplicate-detection Machine Learning
Post date: 12 Dec 2022

PD-Diffusion

fastdup

InfluxDB

Related posts

Visualize your dataset using DINOv2 embedding

Visualize your dataset using DINOv2 embedding

[R][P] How to extract feature vectors of large datasets using DINOv2 on CPU

Computer Vision pre-trained model for finding how similar two photos of a room are

Find image duplicates and outliers – A free, scalable, efficient tool

Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com data-curation Dataset Deep Learning image-duplicate-detection Machine Learning Post date: 12 Dec 2022

PD-Diffusion

fastdup

InfluxDB

Related posts

Visualize your dataset using DINOv2 embedding

Visualize your dataset using DINOv2 embedding

[R][P] How to extract feature vectors of large datasets using DINOv2 on CPU

Computer Vision pre-trained model for finding how similar two photos of a room are

Find image duplicates and outliers – A free, scalable, efficient tool

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
data-curation Dataset Deep Learning image-duplicate-detection Machine Learning
Post date: 12 Dec 2022