Show HN: Fine-tune your own Llama 2 to replace GPT-3.5/4

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • OpenPipe

    Turn expensive prompts into cheap fine-tuned models

  • There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at https://news.ycombinator.com/item?id=37090632). I've been playing around with fine-tuning models for a couple of years, and wanted to share some insights and practical code. I’ve condensed what I’ve learned into a small set of notebooks at https://github.com/OpenPipe/OpenPipe/tree/main/examples/classify-recipes, covering labeling data, fine-tuning, running efficient inference, and evaluating costs/performance. The 7B model we train here matches GPT-4’s labels 95% of the time on the test set, and for the 5% of cases where they disagree it’s often because the correct answer is genuinely ambiguous.

    What is fine-tuning? You can think of it as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Fine-tuning can work with as few as 50 examples but I usually try to get 1000+ if possible.

    Prompting still has some big advantages over fine-tuning. It's way easier/faster to iterate on your instructions than label data and re-train a model. And operationally it's easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.

    Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

    For example, classifying the 2M recipes at https://huggingface.co/datasets/corbt/all-recipes with GPT-4 would cost $23k. Even with GPT-3.5 it would cost over $1k. The model we fine-tuned performs similarly to GPT-4 and costs just $19 to run over the entire dataset.

    Disclaimer: My brother David and I are working on an open-source product called OpenPipe to help engineers adopt fine-tuning as simply as possible. But none of the information above depends on our startup. The current post is just about sharing information that we’ve learned about fine-tuning. I hope it’s useful!

    If you're interested in finding out more you can watch our repo at https://github.com/openpipe/openpipe, and if you'd just like to chat about a fine-tuning project you're thinking about you can also just email me at kyle@openpipe.ai!

  • axolotl

    Go ahead and axolotl questions

  • There are already many hundreds of finetunes on huggingface, and many excellent UIs to run them in, like KoboldCPP and Text-gen-ui: https://huggingface.co/models?sort=modified&search=13B

    There is even a crowdsourced version of the UI like artbot: https://lite.koboldai.net/#

    And there are some excellent extant finetuning frameworks, like Aoxotol, that run on consumer GPUs: https://github.com/OpenAccess-AI-Collective/axolotl

    IIRC Text-gen-ui had a QLORA finetuning UI too.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

  • We're finding that when running vLLM (https://github.com/vllm-project/vllm) on an A40 GPU we're getting consistently lower time-to-first-token and lower average token generation time than GPT-3.5, even when processing multiple requests in parallel. A40s are pretty easy to get your hands on these days (much easer than A100s anyway).

    The 50x cheaper (that's 2% of the cost, not 50% of the cost) number does assume 100% utilization, which may or may not be realistic for your use case. If you're doing batch processing as part of a data pipeline, which is not an unusual use case, you can run your GPU at 100% utilization and turn it off when the batch finishes.

    If you've got a highly variable workload then you're right, you'll have much lower utilization numbers. But if you work with an aggregator that can quickly hot swap LoRA fine-tunes (as a disclaimer, my company OpenPipe works in this space) you can get back a lot of that lost efficiency since we can increase/decrease GPU capacity only when our aggregate usage changes, which smooths things out.

  • gpt-llm-trainer

  • Very nice, thanks!

    Check out what Matt Shumer put together as well: https://github.com/mshumer/gpt-llm-trainer.

    I have used his trainer for auto distillation of GPT-4 into GPT3.5 fine tunes, but plan to do the same for Llama as well.

    Cheers!

  • Llama-2-Onnx

  • System: Here's some docs, answer concisely in a sentence.

    YMMV on cost still, depends on cloud vendor, and my intuition & viewpoint agrees with yours, GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model.

    It strikes me now that _very_ likely and not just our intuition: OpenAI's $/GPU hour is likely <= any other vendor's.

    The next big step will come from formalizing the stuff rolling around the local LLM community, for months now it's either been one-off $X.c stunts that run on desktop, and the vast majority of the _actual_ usage and progress is coming from porn-y stuff, like all nascent tech.

    Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4 .

    It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.

    [1] https://github.com/microsoft/Llama-2-Onnx

  • mlc-llm

    Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

  • you already have TVM for the cross platform stuff

    see https://tvm.apache.org/docs/how_to/deploy/android.html

    or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...

    or https://github.com/mlc-ai/mlc-llm

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts