Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
web-llm
Bringing large-language models and chat to web browsers. Everything runs inside the browser with no server support.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
koboldcpp-rocm
AI Inferencing at the Edge. A simple one-file way to run various GGML models with KoboldAI's UI with AMD ROCm offloading
-
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Ollama is awesome. I am however still waiting for the support of controlling the model cache location: https://github.com/jmorganca/ollama/issues/153
This is either for backup purpose, or to share model files with other applications. Those model files are large!
> run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicate
What about https://github.com/ggerganov/llama.cpp ?
It compiles and run easily on Linux.
Maybe they're talking about https://github.com/mlc-ai/mlc-llm which is used for web-llm (https://github.com/mlc-ai/web-llm)? Seems to be using TVM.
Maybe they're talking about https://github.com/mlc-ai/mlc-llm which is used for web-llm (https://github.com/mlc-ai/web-llm)? Seems to be using TVM.
give Dumbar a try, since you're on macOS! https://github.com/JerrySievert/Dumbar
Ollama is awesome. I am part of a team building a code AI application[1], and we want to give devs the option to run it locally instead of only supporting external LLMs from Anthropic, OpenAI, etc. Those big remote LLMs are incredibly powerful and probably the right choice for most devs, but it's good for devs to have a local option as well—for security, privacy, cost, latency, simplicity, freedom, etc.
As an app dev, we have 2 choices:
(1) Build our own support for LLMs, GPU/CPU execution, model downloading, inference optimizations, etc.
(2) Just tell users "run Ollama" and have our app hit the Ollama API on localhost (or shell out to `ollama`).
Obviously choice 2 is much, much simpler. There are some things in the middle, like less polished wrappers around llama.cpp, but Ollama is the only thing that 100% of people I've told about have been able to install without any problems.
That's huge because it's finally possible to build real apps that use local LLMs—and still reach a big userbase. Your userbase is now (pretty much) "anyone who can download and run a desktop app and who has a relatively modern laptop", which is a big population.
I'm really excited to see what people build on Ollama.
(And Ollama will simplify deploying server-side LLM apps as well, but right now from participating in the community, it seems most people are only thinking of it for local apps. I expect that to change when people realize that they can ship a self-contained server app that runs on a cheap AWS/GCP instance and uses an Ollama-executed LLM for various features.)
[1] Shameless plug for the WIP PR where I'm implementing Ollama support in Cody, our code AI app: https://github.com/sourcegraph/cody/pull/905.
Koboldcpp does this: https://github.com/LostRuins/koboldcpp/releases/tag/v1.44.2
They basically just ship executables for different llama.cpp backends and select the correct one with a python script, which is fine, as the executables are really small.
This one is basically SOTA for AMD, if you can install rocm properly:
https://github.com/YellowRoseCx/koboldcpp-rocm
There's a ton of cool opportunity in the runtime layer. I've been keeping my eye on the compiler-based approaches. From what I've gathered many of the larger "production" inference tools use compilers:
- https://github.com/openai/triton
- https://github.com/NVIDIA/TensorRT
TVM and other compiler-based approaches seem to really perform really well and make supporting different backends really easy. A good friend who's been in this space for a while told me llama.cpp is sort of a "hand crafted" version of what these compilers could output, which I think speaks to the craftmanship Georgi and the ggml team have put into llama.cpp, but also the opportunity to "compile" versions of llama.cpp for other model architectures or platforms.