Ollama releases OpenAI API compatibility

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ollama

    Get up and running with Llama 3, Mistral, Gemma, and other large language models.

  • They have manual install instructions [0], and judging by those, what it does is set up a SystemD service that automatically runs on startup. But if you're just looking to play around, I found that downloading [1], making it executable (chmod +x ollama-linux-amd64), and then running it, worked just fine. All without needing root.

    [0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...

    [1] https://ollama.ai/download/ollama-linux-amd64

  • llamafile

    Distribute and run LLMs with a single file.

  • The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)

    1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • YetAnotherChatUI

    Yet another ChatGPT UI. Bring your own API key.

  • I had trouble installing Ollama last time I tried, I'm going to try again tomorrow.

    I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.

    https://github.com/BenWheatley/YetAnotherChatUI

  • llama.cpp

    LLM inference in C/C++

  • It also ships with an openai-compatible server implementation as well now that you could point your UI at (if you wanted to run leaner w/out ollama).

    https://github.com/ggerganov/llama.cpp/blob/master/examples/...

  • tensorrtllm_backend

    The Triton TensorRT-LLM Backend

  • Nvidia Triton Inference Server with the TensorRT-LLM backend:

    https://github.com/triton-inference-server/tensorrtllm_backe...

    It’s used by Mistral, AWS, Cloudflare, and countless others.

    vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).

    100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.

    Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

  • Nvidia Triton Inference Server with the TensorRT-LLM backend:

    https://github.com/triton-inference-server/tensorrtllm_backe...

    It’s used by Mistral, AWS, Cloudflare, and countless others.

    vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).

    100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.

    Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

  • lookma

    LookMa connects Android devices to locally-run LLMs

  • Ollama is great. If you want a GUI, LMStudio and Jan are great too.

    I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.

    https://github.com/sampatt/lookma

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • model_navigator

    Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.

  • - While keeping power utilization below X

    They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.

    - Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.

    - gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.

    - DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.

    vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.

    [0] - https://mistral.ai/news/la-plateforme/

    [1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...

    [2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...

    [3] - https://github.com/triton-inference-server/model_navigator

    [4] - https://github.com/triton-inference-server/client/blob/main/...

    [5] - https://github.com/triton-inference-server/dali_backend

  • client

    Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala. (by triton-inference-server)

  • - While keeping power utilization below X

    They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.

    - Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.

    - gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.

    - DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.

    vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.

    [0] - https://mistral.ai/news/la-plateforme/

    [1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...

    [2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...

    [3] - https://github.com/triton-inference-server/model_navigator

    [4] - https://github.com/triton-inference-server/client/blob/main/...

    [5] - https://github.com/triton-inference-server/dali_backend

  • dali_backend

    The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.

  • - While keeping power utilization below X

    They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.

    - Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.

    - gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, etc can send/receive raw data/numpy/tensors.

    - DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, and get output.

    vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.

    [0] - https://mistral.ai/news/la-plateforme/

    [1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...

    [2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...

    [3] - https://github.com/triton-inference-server/model_navigator

    [4] - https://github.com/triton-inference-server/client/blob/main/...

    [5] - https://github.com/triton-inference-server/dali_backend

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: Ellipsis – Automated PR reviews and bug fixes

    6 projects | news.ycombinator.com | 9 May 2024
  • Setup REST-API service of AI by using Local LLMs with Ollama

    3 projects | dev.to | 9 May 2024
  • Ollama v0.1.34 Is Out

    1 project | news.ycombinator.com | 8 May 2024
  • Ask HN: What do you use local LLMs for?

    2 projects | news.ycombinator.com | 7 May 2024
  • How to use Google Gemini AI for Agriculture Productivity

    2 projects | dev.to | 7 May 2024