Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
S-LoRA Alternatives
Similar projects and alternatives to S-LoRA
-
text-generation-webui
A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
litellm
Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)
-
ollama-webui
Discontinued ChatGPT-Style WebUI for LLMs (Formerly Ollama WebUI) [Moved to: https://github.com/open-webui/open-webui]
-
big-AGI
Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
S-LoRA reviews and mentions
-
Representation Engineering: Mistral-7B on Acid
You can also batch requests using different LoRAs. See "S-LoRA: Serving Thousands of Concurrent LoRA Adapters". https://arxiv.org/abs/2311.03285
- S-LoRA: Serving Concurrent LoRA Adapters
-
LM Studio – Discover, download, and run local LLMs
Depending on what you mean by "production" you'll probably want to look at "real" serving implementations like HF TGI, vLLM, lmdeploy, Triton Inference Server (tensorrt-llm), etc. There are also more bespoke implementations for things like serving large numbers of LoRA adapters[0].
These are heavily optimized for more efficient memory usage, performance, and responsiveness when serving large numbers of concurrent requests/users in addition to things like model versioning/hot load/reload/etc, Prometheus metrics, things like that.
One major difference is at this level a lot of the more aggressive memory optimization techniques and support for CPU aren't even considered. Generally speaking you get GPTQ and possibly AWQ quantization + their optimizations + CUDA only. Their target users and their use cases are often using A100/H100 and just trying to need fewer of them. Support for lower VRAM cards, older CUDA compute architectures, etc come secondary to that (for the most part).
[0] - https://github.com/S-LoRA/S-LoRA
- GitHub - S-LoRA/S-LoRA: S-LoRA: Serving Thousands of Concurrent LoRA Adapters
-
A note from our sponsor - InfluxDB
www.influxdata.com | 8 May 2024
Stats
S-LoRA/S-LoRA is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of S-LoRA is Python.
Sponsored