Ollama releases OpenAI API compatibility

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

ollama

204 64,536 9.9 Go

Get up and running with Llama 3, Mistral, Gemma, and other large language models.

They have manual install instructions [0], and judging by those, what it does is set up a SystemD service that automatically runs on startup. But if you're just looking to play around, I found that downloading [1], making it executable (chmod +x ollama-linux-amd64), and then running it, worked just fine. All without needing root.
[0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...
[1] https://ollama.ai/download/ollama-linux-amd64

llamafile

36 15,120 9.6 C++

Distribute and run LLMs with a single file.

The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)
1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
YetAnotherChatUI

1 1 8.1 HTML

Yet another ChatGPT UI. Bring your own API key.

I had trouble installing Ollama last time I tried, I'm going to try again tomorrow.
I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.
https://github.com/BenWheatley/YetAnotherChatUI

llama.cpp

777 57,463 10.0 C++

LLM inference in C/C++

It also ships with an openai-compatible server implementation as well now that you could point your UI at (if you wanted to run leaner w/out ollama).
https://github.com/ggerganov/llama.cpp/blob/master/examples/...

tensorrtllm_backend

2 490 7.9 Python

The Triton TensorRT-LLM Backend

Nvidia Triton Inference Server with the TensorRT-LLM backend:
https://github.com/triton-inference-server/tensorrtllm_backe...
It’s used by Mistral, AWS, Cloudflare, and countless others.
vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).
100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.
Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

tensorrtllm_backe

2 - -

Nvidia Triton Inference Server with the TensorRT-LLM backend:
https://github.com/triton-inference-server/tensorrtllm_backe...
It’s used by Mistral, AWS, Cloudflare, and countless others.
vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).
100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.
Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

lookma

1 7 8.2 JavaScript

LookMa connects Android devices to locally-run LLMs

Ollama is great. If you want a GUI, LMStudio and Jan are great too.
I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.
https://github.com/sampatt/lookma

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
model_navigator

1 157 8.9 Python

Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.

- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend

client

2 488 9.4 C++

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala. (by triton-inference-server)

- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend

dali_backend

1 117 6.8 C++

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.

- While keeping power utilization below X
They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.
- Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.
- gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.
- DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.
vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.
[0] - https://mistral.ai/news/la-plateforme/
[1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...
[2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...
[3] - https://github.com/triton-inference-server/model_navigator
[4] - https://github.com/triton-inference-server/client/blob/main/...
[5] - https://github.com/triton-inference-server/dali_backend

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: Ellipsis – Automated PR reviews and bug fixes

6 projects | news.ycombinator.com | 9 May 2024
Setup REST-API service of AI by using Local LLMs with Ollama

3 projects | dev.to | 9 May 2024
Ollama v0.1.34 Is Out

1 project | news.ycombinator.com | 8 May 2024
Ask HN: What do you use local LLMs for?

2 projects | news.ycombinator.com | 7 May 2024
How to use Google Gemini AI for Agriculture Productivity

2 projects | dev.to | 7 May 2024

Ollama releases OpenAI API compatibility

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Artificial intelligence
Post date: 8 Feb 2024

ollama

llamafile

InfluxDB

YetAnotherChatUI

llama.cpp

tensorrtllm_backend

tensorrtllm_backe

lookma

SaaSHub

model_navigator

client

dali_backend

Related posts

Show HN: Ellipsis – Automated PR reviews and bug fixes

Setup REST-API service of AI by using Local LLMs with Ollama

Ollama v0.1.34 Is Out

Ask HN: What do you use local LLMs for?

How to use Google Gemini AI for Agriculture Productivity

Ollama releases OpenAI API compatibility

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Artificial intelligence Post date: 8 Feb 2024

Related posts

Show HN: Ellipsis – Automated PR reviews and bug fixes

Setup REST-API service of AI by using Local LLMs with Ollama

Ollama v0.1.34 Is Out

Ask HN: What do you use local LLMs for?

How to use Google Gemini AI for Agriculture Productivity

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Artificial intelligence
Post date: 8 Feb 2024