[D] Would a Tesla M40 provide cheap inference acceleration for self-hosted LLMs?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. llama.cpp

    LLM inference in C/C++

    Not what you asked, but there are projects out there to run llama and its descendants on CPU only using system RAM. See https://github.com/ggerganov/llama.cpp/discussions/643 for a discussion about running Vicuna-13b in particular. The 4 bit quantized model should run comfortably within a 16GB system.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. nvidia-docker

    Discontinued Build and run Docker containers leveraging NVIDIA GPUs

  4. nvidia-patch

    This patch removes restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs.

  5. vGPU_LicenseBypass

    A simple script that works around Nvidia vGPU licensing with a scheduled task.

  6. text-generation-webui

    A Gradio web UI for Large Language Models with support for multiple inference backends.

    I’ve been using oobabooga text webui (the one-click installer) and it works mostly fine. Just need to make sure you have a model that works with it, like one of the 4bit 128 group quantized models, and set the right arguments in start-webui.bat for the model

  7. turbopilot

    Discontinued Turbopilot is an open source large-language-model based code completion engine that runs locally on CPU

    I don't know if this applies to your use case but this would probably work if you are looking for an llm to help with programming. Haven't really played around with it but this may work for general llm tasks, it doesn't have a web UI though.

  8. simpleAI

    An easy way to host your own AI API and expose alternative models, while being compatible with "open" AI clients.

    I don't know if this applies to your use case but this would probably work if you are looking for an llm to help with programming. Haven't really played around with it but this may work for general llm tasks, it doesn't have a web UI though.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Plex setup through Docker + Nvidia card, but hardware acceleration stops working after some time

    2 projects | /r/PleX | 3 Jun 2023
  • Seeking Guidance on Leveraging Local Models and Optimizing GPU Utilization in containerized packages

    1 project | /r/LocalLLaMA | 21 May 2023
  • Which GPU for HW transcoding in PMS: Intel Arc or Nvidia?

    1 project | /r/PleX | 20 Apr 2023
  • Help! Accelerated-GPU with Cuda and CuPy

    1 project | /r/wsl2 | 8 Apr 2023
  • Plex Transcode (VC1 (HW) 1080p H264 (HW) 1080p) on Pixel 7 Pro

    1 project | /r/PleX | 3 Apr 2023

Did you know that Python is
the 2nd most popular programming language
based on number of references?