-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I've been experimenting with deploying a model using two platforms: vLLM and TGI. While using the standard fp16 version, both platforms perform fairly comparably. However, I observed a significant performance gap when deploying the GPTQ 4bits version on TGI as opposed to vLLM.
I've been experimenting with deploying a model using two platforms: vLLM and TGI. While using the standard fp16 version, both platforms perform fairly comparably. However, I observed a significant performance gap when deploying the GPTQ 4bits version on TGI as opposed to vLLM.
The models are TheBloke/Llama2-7B-fp16 and TheBloke/Llama2-7B-GPTQ. I'm using 1000 prompts with a request rate (number of requests per second) of 10. By default, vLLM does not support for GPTQ, so I'm using this version: vLLM-GPTQ.
Related posts
-
Hugging Face reverts the license back to Apache 2.0
-
AI Code assistant for about 50-70 users
-
Continuous batch enables 23x throughput in LLM inference and reduce p50 latency
-
HuggingFace Text Generation License No Longer Open-Source
-
HuggingFace Text Generation Library License Changed from Apache 2 to Hfoil