Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Hi, we’ve been working to support AMD GPUs directly via ROCm. It’s still under development but if you build from source it does work:
https://github.com/ollama/ollama/blob/main/docs/development....
For anyone else who missed the announcement a few hours ago, open-webui is the rebranding of the project formerly known as ollama-webui [0].
I can vouch for it as a solid frontend for Ollama. It works really well and has had an astounding pace of development. Every few weeks I pull the latest docker images and am always surprised by how much has improved.
[0] https://github.com/open-webui/open-webui/discussions/764
But these are typically filling the usecases of productivity applications, not ‘engines’.
Microsoft Word doesn’t run its grammar checker as an external service and shunt JSON over a localhost socket to get spelling and style suggestions.
Photoshop doesn’t install a background service to host filters.
The closest pattern I can think of is the ‘language servers’ model used by IDEs to handle autosuggest - see https://microsoft.github.io/language-server-protocol/ - but the point of that is to enable many to many interop - multiple languages supporting multiple IDEs. Is that the expected usecase for local language assistants and image generators?
Cody (https://github.com/sourcegraph/cody) supports using Ollama for autocomplete in VS Code. See the release notes at https://sourcegraph.com/blog/cody-vscode-1.1.0-release for instructions. And soon it'll support Ollama for chat/refactoring as well (https://twitter.com/sqs/status/1750045006382162346/video/1).
Disclaimer: I work on Cody.
If you just check out https://github.com/ggerganov/llama.cpp and run make, you’ll wind up with an executable called ‘main’ that lets you run any gguf language model you choose. Then:
./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt “say hello”
On my M2 MacBook, the first run takes a few seconds before it produces anything, but after that subsequent runs start outputting tokens immediately.
You can run LLM models right inside a short lived process.
But the majority of humans don’t want to use a single execution of a command line to access LLM completions. They want to run a program that lets them interact with an LLM. And to do that they will likely start and leave running a long-lived process with UI state - which can also serve as a host for a longer lived LLM context.
Neither usecase particularly seems to need a server to function. My curiosity about why people are packaging these things up like that is completely genuine.