Ask HN: How does deploying a fine-tuned model work

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • dstack

    An open-source container orchestration engine for running AI workloads in any cloud or data center. https://discord.gg/u8SmfwPpMd

  • You can use https://github.com/dstackai/dstack to deploy your model to the most affordable GPU clouds. It supports auto-scaling and other features.

    Disclaimer: Iā€™m the creator of dstack.

  • OpenPipe

    Turn expensive prompts into cheap fine-tuned models

  • - Fireworks: $0.20

    If you're looking for an end-to-end flow that will help you gather the training data, validate it, run the fine tune and then define evaluations, you could also check out my company, OpenPipe (https://openpipe.ai/). In addition to hosting your model, we help you organize your training data, relabel if necessary, define evaluations on the finished fine-tune, and monitor its performance in production. Our inference prices are higher than the above providers, but once you're happy with your model you can always export your weights and host them on one of the above!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • next-token-prediction

    Next-token prediction in JavaScript ā€” build fast language and diffusion models.

  • GPU vs CPU:

    It's faster to use a GPU. If you tried to play a game on a laptop with onboard gfx vs buying a good external graphics card, it might technically work, but a good GPU gives you more processing power and VRAM to make it a faster experience.

    When is GPU needed:

    You need it for both initial training (which it sounds like you've done) and also when someone prompts the LLM and it parses their query (called inference). So to answer your question - your web server that handles LLM queries coming in also needs a great GPU because with any amount of user activity it will be running effectively 24/7 as users are continually prompting it, as they would use any other site you have online.

    When is GPU not needed:

    Computationally, inference is just "next token prediction", but depending on how the user enters their prompt sometimes it's able to provide those predictions (called completions) with pre-computed embeddings, or in other words by performing a simple lookup, and the GPU is not invoked. For example in this autocompletion/token-prediction library I wrote that uses an ngram language model (https://github.com/bennyschmidt/next-token-prediction), GPU is only needed for initial training on text data, but there's no inference component to it - so completions are fast and don't invoke the GPU, they are effectively lookups. An LM like this could be trained offline and deployed cheaply, no cloud GPU needed. And you will notice that LLMs sometimes will work this way, especially with follow-up prompting once it already has the needed embeddings from the initial prompt - for some responses, an LLM is fast like this.

    On-prem:

    Beyond the GPU requirement, it's not fundamentally different than any other web server. You can buy/build a gaming PC with a decent GPU, forward ports, get a domain, install a cert, run your model locally, and now you have an LLM server online. If you like Raspberry Pi, you might look into the NVIDIA Jetson Nano (https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...) as it's basically a tiny computer like the Pi but with a GPU and designed for AI. So you can cheaply and easily get an AI/LLM server running out of your apartment.

    Cloud & serverless:

    Hosting is not very different from conventional web servers except that their hardware has more VRAM and their software is designed for LLM access rather than a web backend (different db technologies, different frameworks/libraries). Of course AWS already has options for deploying your own models and there are a number of tutorials showing how to deploy Ollama on EC2. There's also serverless providers - Replicate, Lightning.AI - these are your Vercels and Herokus that you might pay a little more for but get convenience so you can get up and running quickly.

    TLDR: It's like any other web server except you need more GPU/VRAM to do training and inference. Whether you want to run it yourself on-prem, host in the cloud, use a PaaS, etc. those are mostly the same as any other project.

  • aime-api-server

    AIME API Server - Scalable AI Model Inference API Server

  • Here is a queueing api server for self hosted inference backends: https://github.com/aime-team/aime-api-server from a friend of mine. Very light weight and easy to use. You can even serve models from Jupyter Notebooks with it without needing to worry about overwhelming the server. It just gets slower the more load you send to it.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: FileKitty ā€“ Combine and label text files for LLM prompt contexts

    5 projects | news.ycombinator.com | 1 May 2024
  • Alternative Chunking Methods

    1 project | news.ycombinator.com | 30 Apr 2024
  • FLaNK AI Weekly for 29 April 2024

    44 projects | dev.to | 29 Apr 2024
  • Show HN: I made an app to use local AI as daily driver

    31 projects | news.ycombinator.com | 27 Feb 2024
  • Show HN: I Built an Open Source API with Insanely Fast Whisper and Fly GPUs

    3 projects | news.ycombinator.com | 18 Feb 2024