Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • core

    Production ready AI assistant framework (by cheshire-cat-ai)

  • I haven't personally tried this for anything serious yet, but to get the thread started:

    Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.

    [0] https://github.com/cheshire-cat-ai/core

  • Verba

    Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

  • So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba

    I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • langroid

    Harness LLMs with Multi-Agent Programming

  • Many services/platforms are careless/disingenuous when they claim they “train” on your documents, where they actually mean they do RAG.

    An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.

    In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid

  • khoj

    Your AI second brain. A copilot to get answers to your questions, whether they be from your own notes or from the internet. Use powerful, online (e.g gpt4) or private, local (e.g mistral) LLMs. Self-host locally or use our web app. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.

  • I'm a fan of Khoj. Been using it for months. https://github.com/khoj-ai/khoj

  • private-gpt

    Interact with your documents using the power of GPT, 100% privately, no data leaks

  • Run https://github.com/imartinez/privateGPT

    Then

    make ingest /path/to/folder/with/files

    Then chat to the LLM.

    Done.

    Docs: https://docs.privategpt.dev/overview/welcome/quickstart

  • gpt4all

    gpt4all: run open-source LLMs anywhere

  • Gpt4all is a local desktop app with a Python API that can be trained on your documents: https://gpt4all.io/

  • gpt-researcher

    GPT based autonomous agent that does online comprehensive research on any given topic

  • Hey, GPT Researcher shows exactly how to do that with RAG. See here https://github.com/assafelovic/gpt-researcher

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • Since no one has mentioned it so far: I did just this recently with txtai in a few lines of code.

    https://neuml.github.io/txtai/

  • h2ogpt

    Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

  • As others have said you want RAG.

    The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).

    The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.

    It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.

    You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.

    [0] - https://github.com/h2oai/h2ogpt

  • anything-llm

    The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.

  • anything-llm looks pretty interesting and easy to use https://github.com/Mintplex-Labs/anything-llm

  • SecureAI-Tools

    Private and secure AI tools for everyone's productivity.

  • Try https://github.com/SecureAI-Tools/SecureAI-Tools -- it's an open-source application layer for Retrieval-Augmented Generation (RAG). It allows you to use any LLM -- you can use OpenAI APIs, or run models locally with Ollama.

  • embedchain

    Personalizing LLM Responses

  • You can use embedchain[1] to connect various data sources and then get a RAG application running on your local and production very easily. Embedchain is an open source RAG framework and It follows a conventional but configurable approach.

    The conventional approach is suitable for software engineer where they may not be less familiar with AI. The configurable approach is suitable for ML engineer where they have sophisticated uses and would want to configure chunking, indexing and retrieval strategies.

    [1]: https://github.com/embedchain/embedchain

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts