Ask HN: How does deploying a fine-tuned model work

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

dstack

17 1,102 9.8 Python

An open-source container orchestration engine for running AI workloads in any cloud or data center. https://discord.gg/u8SmfwPpMd

You can use https://github.com/dstackai/dstack to deploy your model to the most affordable GPU clouds. It supports auto-scaling and other features.
Disclaimer: I’m the creator of dstack.

OpenPipe

13 2,374 9.9 TypeScript

Turn expensive prompts into cheap fine-tuned models

- Fireworks: $0.20
If you're looking for an end-to-end flow that will help you gather the training data, validate it, run the fine tune and then define evaluations, you could also check out my company, OpenPipe (https://openpipe.ai/). In addition to hosting your model, we help you organize your training data, relabel if necessary, define evaluations on the finished fine-tune, and monitor its performance in production. Our inference prices are higher than the above providers, but once you're happy with your model you can always export your weights and host them on one of the above!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
next-token-prediction

6 119 5.6 JavaScript

Next-token prediction in JavaScript — build fast language and diffusion models.

GPU vs CPU:
It's faster to use a GPU. If you tried to play a game on a laptop with onboard gfx vs buying a good external graphics card, it might technically work, but a good GPU gives you more processing power and VRAM to make it a faster experience.
When is GPU needed:
You need it for both initial training (which it sounds like you've done) and also when someone prompts the LLM and it parses their query (called inference). So to answer your question - your web server that handles LLM queries coming in also needs a great GPU because with any amount of user activity it will be running effectively 24/7 as users are continually prompting it, as they would use any other site you have online.
When is GPU not needed:
Computationally, inference is just "next token prediction", but depending on how the user enters their prompt sometimes it's able to provide those predictions (called completions) with pre-computed embeddings, or in other words by performing a simple lookup, and the GPU is not invoked. For example in this autocompletion/token-prediction library I wrote that uses an ngram language model (https://github.com/bennyschmidt/next-token-prediction), GPU is only needed for initial training on text data, but there's no inference component to it - so completions are fast and don't invoke the GPU, they are effectively lookups. An LM like this could be trained offline and deployed cheaply, no cloud GPU needed. And you will notice that LLMs sometimes will work this way, especially with follow-up prompting once it already has the needed embeddings from the initial prompt - for some responses, an LLM is fast like this.
On-prem:
Beyond the GPU requirement, it's not fundamentally different than any other web server. You can buy/build a gaming PC with a decent GPU, forward ports, get a domain, install a cert, run your model locally, and now you have an LLM server online. If you like Raspberry Pi, you might look into the NVIDIA Jetson Nano (https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...) as it's basically a tiny computer like the Pi but with a GPU and designed for AI. So you can cheaply and easily get an AI/LLM server running out of your apartment.
Cloud & serverless:
Hosting is not very different from conventional web servers except that their hardware has more VRAM and their software is designed for LLM access rather than a web backend (different db technologies, different frameworks/libraries). Of course AWS already has options for deploying your own models and there are a number of tutorials showing how to deploy Ollama on EC2. There's also serverless providers - Replicate, Lightning.AI - these are your Vercels and Herokus that you might pay a little more for but get convenience so you can get up and running quickly.
TLDR: It's like any other web server except you need more GPU/VRAM to do training and inference. Whether you want to run it yourself on-prem, host in the cloud, use a PaaS, etc. those are mostly the same as any other project.

aime-api-server

1 10 9.5 Python

AIME API Server - Scalable AI Model Inference API Server

Here is a queueing api server for self hosted inference backends: https://github.com/aime-team/aime-api-server from a friend of mine. Very light weight and easy to use. You can even serve models from Jupyter Notebooks with it without needing to worry about overwhelming the server. It just gets slower the more load you send to it.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: FileKitty – Combine and label text files for LLM prompt contexts

5 projects | news.ycombinator.com | 1 May 2024
Alternative Chunking Methods

1 project | news.ycombinator.com | 30 Apr 2024
FLaNK AI Weekly for 29 April 2024

44 projects | dev.to | 29 Apr 2024
Show HN: I made an app to use local AI as daily driver

31 projects | news.ycombinator.com | 27 Feb 2024
Show HN: I Built an Open Source API with Insanely Fast Whisper and Fly GPUs

3 projects | news.ycombinator.com | 18 Feb 2024

Ask HN: How does deploying a fine-tuned model work

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning AI Python llm ml-pipelines
Post date: 23 Apr 2024

dstack

OpenPipe

InfluxDB

next-token-prediction

aime-api-server

Related posts

Show HN: FileKitty – Combine and label text files for LLM prompt contexts

Alternative Chunking Methods

FLaNK AI Weekly for 29 April 2024

Show HN: I made an app to use local AI as daily driver

Show HN: I Built an Open Source API with Insanely Fast Whisper and Fly GPUs

Ask HN: How does deploying a fine-tuned model work

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Machine Learning AI Python llm ml-pipelines Post date: 23 Apr 2024

dstack

OpenPipe

InfluxDB

next-token-prediction

aime-api-server

Related posts

Show HN: FileKitty – Combine and label text files for LLM prompt contexts

Alternative Chunking Methods

FLaNK AI Weekly for 29 April 2024

Show HN: I made an app to use local AI as daily driver

Show HN: I Built an Open Source API with Insanely Fast Whisper and Fly GPUs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning AI Python llm ml-pipelines
Post date: 23 Apr 2024