Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
nitro
An inference server on top of llama.cpp. OpenAI-compatible API, queue, & scaling. Embed a prod-ready, local inference engine in your apps. Powers Jan (by janhq)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
text-generation-webui
A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
On my 32G M2 Pro Mac, I can run up to about 30B models using 4 bit quantization. It is fast unless I am generating a lot of text. If I ask a 30B model to generate 5 pages of text it can take over 1 minute. Running smaller models like Mistral 7B is very fast.
Install Ollama from https://ollama.ai and experiment with it using the command line interface. I mostly use Ollama’s local API from Common Lisp or Racket - so simple to do.
In my experience award for easiest to run locally will go to llamafile models https://github.com/Mozilla-Ocho/llamafile.
I'd like to see a comparison to nitro https://github.com/janhq/nitro which has been fantastic for running a local LLM.
I don’t know if Ollama can do this but https://gpt4all.io/ can.
I made something pretty similar over winter break so I could have something read books to me. ... Then it turned into a prompting mechanism of course! It uses Whisper, Ollama, and TTS from CoquiAI. It's written in shell and should hopefully be "Posix-compliant", but it does use zenity from Ubuntu; not sure how widely used zenity is.
https://github.com/jcmccormick/runtts
There are "smaller" models, for example tinyllama 1.1B (tiny seems like an exaggeration). PHI2 is 2.7B parameters. I can't name a 500M parameter model but there is probably one.
The problem is they are all still broadly trained and so they end up being Jack of all trades master of none. You'd have to fine tune them if you want them good at some narrow task and other than code completion I don't know that anyone has done that.
If you want to generate json or other structured output, there is Outlines https://github.com/outlines-dev/outlines that constrains the output to match a regex so it guarantees e.g. the model will generate a valid API call, although it could still be nonsense if the model doesn't understand, it will just match the regex. There are other similar tools around.
Not really. You can use small models for task like text classification etc (traditional nlp) and those run in pretty much anything. We're talking about BERT-like models like distillbert for example.
Now, models that have "reasoning" as an emergent property... I haven't seen anthing under 3B that's capable of making anything useful. The smaller I've seen is litellama and while it's not 100% useless, it's really just an experiment.
Also, everything requires new and/or expensive hardware. For GPU you really are about 1k€ at minumum for something decent for running models. CPU inference is way slower and forget about anythin that has no AVX and preferably AVX2.
I try models on my old thinkpad x260 with 8Gb ram, which is perfectly capable for developing stuff and those small task oriented I've told you about, but even though I've tried everything under the sun, with quantization etc, it's safe to say you can only run decent LLMs with a decent inference speed with expensive hardware now.
Now, if you want task like, language detection, classifying text into categories, etc, very basic Question Answering, then go on HugginFace and try youself, you'll be capable of running most models on modest hardware.
In fact, I have a website (https://github.com/iagovar/cometocoruna/tree/main) where I'm using a small flask server in my data pipeline to extract event information from text blobs I get scraping sites.
Experts in the field say that might change (somewhat) with mamba models, but I can't really say more.
I've been playing with the idea of dumping some money. But I'm 36, unemployed and just got into coding about 1.5 years ago, so until I secure some income I don't want to hit my saving hard, this is not the US where I can land a job easy (Junior looking for job, just in case someone here needs one).
Same question here. Ollama is fantastic as it makes it very easy to run models locally, But if you already have a lot of code that processes OpenAI API responses (with retry, streaming, async, caching etc), it would be nice to be able to simply switch the API client to Ollama, without having to have a whole other branch of code that handles Alama API responses. One way to do an easy switch is using the litellm library as a go-between but it’s not ideal (and I also recently found issues with their chat formatting for mistral models).
For an OpenAI compatible API my current favorite method is to spin up models using oobabooga TGW. Your OpenAI API code then works seamlessly by simply switching out the api_base to the ooba endpoint. Regarding chat formatting, even ooba’s Mistral formatting has issues[1] so I am doing my own in Langroid using HuggingFace tokenizer.apply_chat_template [2]
[1] https://github.com/oobabooga/text-generation-webui/issues/53...
[2] https://github.com/langroid/langroid/blob/main/langroid/lang...
Related question - I assume ollama auto detects and applies the right chat formatting template for a model?
Same question here. Ollama is fantastic as it makes it very easy to run models locally, But if you already have a lot of code that processes OpenAI API responses (with retry, streaming, async, caching etc), it would be nice to be able to simply switch the API client to Ollama, without having to have a whole other branch of code that handles Alama API responses. One way to do an easy switch is using the litellm library as a go-between but it’s not ideal (and I also recently found issues with their chat formatting for mistral models).
For an OpenAI compatible API my current favorite method is to spin up models using oobabooga TGW. Your OpenAI API code then works seamlessly by simply switching out the api_base to the ooba endpoint. Regarding chat formatting, even ooba’s Mistral formatting has issues[1] so I am doing my own in Langroid using HuggingFace tokenizer.apply_chat_template [2]
[1] https://github.com/oobabooga/text-generation-webui/issues/53...
[2] https://github.com/langroid/langroid/blob/main/langroid/lang...
Related question - I assume ollama auto detects and applies the right chat formatting template for a model?
Related posts
- Show HN: Ragdoll Studio (fka Arthas.AI) is the FOSS alternative to character.ai
- Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
- SuperDuperDB - how to use it to talk to your documents locally using llama 7B or Mistral 7B?
- Show HN: Common protocol for communication with (and between) AI Agents
- Show HN: LLMFlows – LangChain alternative for explicit and transparent apps