Is supervised learning dead for computer vision?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Segment-Everything-Everywhere-All-At-Once

    [NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"

  • Foundational models are generally trained on internet scale level of data. They have seen billions of images, so they would have seen some medical images. For example, extracted from public datasets or textbooks. However, indeed, they may not be specialized to your use case. You could still fine-tune the model with a couple of examples to be more tailored to what you desire. Having a foundation model does not exclude training and your data could still be valuable. Indeed, you could achieve better performance by fine-tuning the larger model than just using your training data alone to train a model from scratch.

    Also for the medical domain, I think vision-text segmentation models like SEEM (https://github.com/UX-Decoder/Segment-Everything-Everywhere-...) are really cool. You could for example ask “Where is the tumor located on that image?” and then the tumor is highlighted in the picture.

  • Foundational models are generally trained on internet scale level of data. They have seen billions of images, so they would have seen some medical images. For example, extracted from public datasets or textbooks. However, indeed, they may not be specialized to your use case. You could still fine-tune the model with a couple of examples to be more tailored to what you desire. Having a foundation model does not exclude training and your data could still be valuable. Indeed, you could achieve better performance by fine-tuning the larger model than just using your training data alone to train a model from scratch.

    Also for the medical domain, I think vision-text segmentation models like SEEM (https://github.com/UX-Decoder/Segment-Everything-Everywhere-...) are really cool. You could for example ask “Where is the tumor located on that image?” and then the tumor is highlighted in the picture.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • LLaVA

    [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

  • Hey Everyone,

    I’ve been diving deep into the world of computer vision recently, and I’ve gotta say, things are getting pretty exciting! I stumbled upon this vision-language model called LLaVA (https://github.com/haotian-liu/LLaVA), and it’s been nothing short of impressive.

    In the past, if you wanted to teach a model to recognize the color of your car in an image, you’d have to go through the tedious process of training it from scratch. But now, with models like LLaVA, all you need to do is prompt it with a question like “What’s the color of the car?” and bam – you get your answer, zero-shot style.

    It’s kind of like what we’ve seen in the NLP world. People aren’t training language models from the ground up anymore; they’re taking pre-trained models and fine-tuning them for their specific needs. And it looks like we’re headed in the same direction with computer vision.

    Imagine being able to extract insights from images with just a simple text prompt. Need to step it up a notch? A bit of fine-tuning can do wonders, and from my experiments, it can even outperform models trained from scratch. It’s like getting the best of both worlds!

    But here’s the real kicker: these foundational models, thanks to their extensive training on massive datasets, have an incredible grasp of image representations. This means you can fine-tune them with just a handful of examples, saving you the trouble of collecting thousands of images. Indeed, they can even learn with a single example (https://www.fast.ai/posts/2023-09-04-learning-jumps)

  • datasaurus

    Do computer vision with 1000x less data (by datasaurus-ai)

  • And let’s talk about development speed. By using text prompts to interact with your images, you can whip up a computer vision prototype in seconds. It’s fast, it’s efficient, and it’s changing the game.

    So, what do you all think? Are we moving towards a future where foundational models take the lead in computer vision, or is there still a place for training models from scratch?

    P.S. Shameless plug: I’ve been working on this open-source platform called Datasaurus https://github.com/datasaurus-ai/datasaurus) that taps into the power of vision-language models. It’s all about helping engineers get the insights they need from images, fast. Just wanted to share some thoughts and start a conversation. Let’s talk about the future of computer vision!

  • guidance

    A guidance language for controlling large language models.

  • Thanks for your comment.

    I did not know about "Betteridge's law of headlines", quite interesting. Thanks for sharing :)

    You raise some interesting points.

    1) Safety: It is true that LVMs and LLMs have unknown biases and could potentially create unsafe content. However, this is not necessarily unique to them, for example, Google had the same problem with their supervised learning model https://www.theverge.com/2018/1/12/16882408/google-racist-go.... It all depends on the original data. I believe we need systems on top of our models to ensure safety. It is also possible to restrict the output domain of our models (https://github.com/guidance-ai/guidance). Instead of allowing our LVMs to output any words, we could restrict it to only being able to answer "red, green, blue..." when giving the color of a car.

    2) Cost: You are right right now LVMs are quite expensive to run. As you said are a great way to go to market faster but they cannot run on low-cost hardware for the moment. However, they could help with training those smaller models. Indeed, with see in the NLP domain that a lot of smaller models are trained on data created with GPT models. You can still distill the knowledge of your LVMs into a custom smaller model that can run on embedded devices. The advantage is that you can use your LVMs to generate data when it is scarce and use it as a fallback when your smaller device is uncertain of the answer.

    3) Labelling data: I don't think labeling data is necessarily cheap. First, you have to collect the data, depending on the frequency of your events could take months of monitoring if you want to build a large-scale dataset. Lastly, not all labeling is necessarily cheap. I worked at a semiconductor company and labeled data was scarce as it required expert knowledge and could only be done by experienced employees. Indeed not all labelling can be done externally.

    However, both approaches are indeed complementary and I think systems that will work the best will rely on both.

    Thanks again for the thought-provoking discussion. I hope this answer some of the concerns you raised

  • LoRA

    Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

  • Yes, your understanding is correct. However, instead of adding a head on top of the network, most fine-tuning is currently done with LoRA (https://github.com/microsoft/LoRA). This introduces low-rank matrices between different layers of your models, those are then trained using your training data while the rest of the models' weights are frozen.

  • autodistill

    Images to inference with no labeling (use foundation models to train supervised models).

  • The places in which a vision model is deployed are different than that of a language model.

    A vision model may be deployed on cameras without an internet connection, with data retrieved later; a vision model may be used on camera streams in a factory; sports broadcasts on which you need low latency. In many cases, real-time -- or close to real-time -- performance is needed.

    Fine-tuned models can deliver the requisite performance for vision tasks with relatively low computational power compared to the LLM equivalent. The weights are small relative to LLM weights.

    LLMs are often deployed via API. This is practical for some vision applications (i.e. bulk processing), but for many use cases not being able to run on the edge is a dealbreaker.

    Foundation models certainly have a place.

    CLIP, for example, works fast, and may be used for a task like classification on videos. Where I see opportunity right now is in using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset. (Disclosure: I co-maintain a Python package that lets you do this, Autodistill -- https://github.com/autodistill/autodistill).

    SAM (segmentation), CLIP (embeddings, classification), Grounding DINO (zero-shot object detection) in particular have a myriad of use cases, one of which is automated labeling.

    I'm looking forward to seeing foundation models improve for all the opportunities that will bring!

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts