Llama-2-Onnx
mlc-llm
Llama-2-Onnx | mlc-llm | |
---|---|---|
3 | 89 | |
998 | 17,555 | |
1.5% | 3.4% | |
6.7 | 9.9 | |
5 months ago | 3 days ago | |
Python | Python | |
GNU General Public License v3.0 or later | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Llama-2-Onnx
-
Show HN: Fine-tune your own Llama 2 to replace GPT-3.5/4
System: Here's some docs, answer concisely in a sentence.
YMMV on cost still, depends on cloud vendor, and my intuition & viewpoint agrees with yours, GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model.
It strikes me now that _very_ likely and not just our intuition: OpenAI's $/GPU hour is likely <= any other vendor's.
The next big step will come from formalizing the stuff rolling around the local LLM community, for months now it's either been one-off $X.c stunts that run on desktop, and the vast majority of the _actual_ usage and progress is coming from porn-y stuff, like all nascent tech.
Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4 .
It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.
[1] https://github.com/microsoft/Llama-2-Onnx
- FLaNK Stack Weekly for 14 Aug 2023
- Llama 2 on ONNX runs locally
mlc-llm
- FLaNK 04 March 2024
-
Ai on a android phone?
This one uses gpu, it doesn't support Mistral yet: https://github.com/mlc-ai/mlc-llm
-
MLC vs llama.cpp
I have tried running mistral 7B with MLC on my m1 metal. And it kept crushing (git issue with description). Memory inefficiency problems.
-
[Project] Scaling LLama2 70B with Multi NVIDIA and AMD GPUs under 3k budget
Project: https://github.com/mlc-ai/mlc-llm
- Scaling LLama2-70B with Multi Nvidia/AMD GPU
-
AMD May Get Across the CUDA Moat
For LLM inference, a shoutout to MLC LLM, which runs LLM models on basically any API that's widely available: https://github.com/mlc-ai/mlc-llm
-
ROCm Is AMD's #1 Priority, Executive Says
One of your problems might be that gfx1032 is not supported by AMD's ROCm packages, which has a laughably short list of supported hardware: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...
The normal workaround is to assign the closest architecture, eg gfx1030, so `HSA_OVERRIDE_GFX_VERSION=10.3.0` might help
Also, it looks like some of your tested projects are OpenCL? For me, I do something like: `yay -S rocm-hip-sdk rocm-ml-sdk rocm-opencl-sdk` to cover all the bases.
My recent interest has been LLMs and this is my general step by step for those (llama.cpp, exllama) for those interested: https://llm-tracker.info/books/howto-guides/page/amd-gpus
I didn't port the docs back in, but also here's a step-by-step w/ my adventures getting TVM/MLC working w/ an APU: https://github.com/mlc-ai/mlc-llm/issues/787
From my experience, ROCm is improving, but there's a good reason that Nvidia has 90% market share even at big price premiums.
-
Show HN: Ollama for Linux – Run LLMs on Linux with GPU Acceleration
Maybe they're talking about https://github.com/mlc-ai/mlc-llm which is used for web-llm (https://github.com/mlc-ai/web-llm)? Seems to be using TVM.
-
Show HN: Fine-tune your own Llama 2 to replace GPT-3.5/4
you already have TVM for the cross platform stuff
see https://tvm.apache.org/docs/how_to/deploy/android.html
or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...
or https://github.com/mlc-ai/mlc-llm
- Ask HN: Are you training and running custom LLMs and how are you doing it?
What are some alternatives?
vllm - A high-throughput and memory-efficient inference and serving engine for LLMs
llama.cpp - LLM inference in C/C++
pkgx - the last thing you’ll install
ggml - Tensor library for machine learning
onnx-coreml - ONNX to Core ML Converter
tvm - Open deep learning compiler stack for cpu, gpu and specialized accelerators
awesome-data-temporality - A curated list to help you manage temporal data across many modalities 🚀.
text-generation-webui - A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
OpenPipe - Turn expensive prompts into cheap fine-tuned models
llama-cpp-python - Python bindings for llama.cpp
ollama - Get up and running with Llama 3, Mistral, Gemma, and other large language models.