Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

core

12 1,949 9.8 Python

Production ready AI assistant framework (by cheshire-cat-ai)

I haven't personally tried this for anything serious yet, but to get the thread started:
Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.
[0] https://github.com/cheshire-cat-ai/core

Verba

4 2,228 8.9 Python

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba
I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
langroid

15 1,509 9.9 Python

Harness LLMs with Multi-Agent Programming

Many services/platforms are careless/disingenuous when they claim they “train” on your documents, where they actually mean they do RAG.
An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.
In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid

khoj

50 4,786 9.9 Python

Your AI second brain. A copilot to get answers to your questions, whether they be from your own notes or from the internet. Use powerful, online (e.g gpt4) or private, local (e.g mistral) LLMs. Self-host locally or use our web app. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.

I'm a fan of Khoj. Been using it for months. https://github.com/khoj-ai/khoj

private-gpt

131 51,732 9.2 Python

Interact with your documents using the power of GPT, 100% privately, no data leaks

Run https://github.com/imartinez/privateGPT
Then
make ingest /path/to/folder/with/files
Then chat to the LLM.
Done.
Docs: https://docs.privategpt.dev/overview/welcome/quickstart

gpt4all

139 64,046 9.8 C++

gpt4all: run open-source LLMs anywhere

Gpt4all is a local desktop app with a Python API that can be trained on your documents: https://gpt4all.io/

gpt-researcher

4 8,463 9.6 Python

GPT based autonomous agent that does online comprehensive research on any given topic

Hey, GPT Researcher shows exactly how to do that with RAG. See here https://github.com/assafelovic/gpt-researcher

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
txtai

355 6,990 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Since no one has mentioned it so far: I did just this recently with txtai in a few lines of code.
https://neuml.github.io/txtai/

h2ogpt

28 10,398 10.0 Python

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

As others have said you want RAG.
The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).
The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.
It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.
You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.
[0] - https://github.com/h2oai/h2ogpt

anything-llm

21 11,955 9.7 JavaScript

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.

anything-llm looks pretty interesting and easy to use https://github.com/Mintplex-Labs/anything-llm

SecureAI-Tools

11 1,377 8.9 TypeScript

Private and secure AI tools for everyone's productivity.

Try https://github.com/SecureAI-Tools/SecureAI-Tools -- it's an open-source application layer for Retrieval-Augmented Generation (RAG). It allows you to use any LLM -- you can use OpenAI APIs, or run models locally with Ollama.

embedchain

6 8,434 9.8 Python

Personalizing LLM Responses

You can use embedchain[1] to connect various data sources and then get a RAG application running on your local and production very easily. Embedchain is an open source RAG framework and It follows a conventional but configurable approach.
The conventional approach is suitable for software engineer where they may not be less familiar with AI. The configurable approach is suitable for ML engineer where they have sophisticated uses and would want to configure chunking, indexing and retrieval strategies.
[1]: https://github.com/embedchain/embedchain

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

A suite of tools designed to streamline the development cycle of LLM-based apps
1 project | news.ycombinator.com | 12 Apr 2024
Agent Cloud VS OpenAI
1 project | dev.to | 11 Apr 2024
Agent Cloud vs CrewAI
1 project | dev.to | 5 Apr 2024
Show HN: Agent Cloud vs. CrewAI
1 project | news.ycombinator.com | 5 Apr 2024
Show HN: I made an app to use local AI as daily driver
31 projects | news.ycombinator.com | 27 Feb 2024

Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
llm chatgpt AI semantic-search openai-api
Post date: 24 Dec 2023

core

Verba

InfluxDB

langroid

khoj

private-gpt

gpt4all

gpt-researcher

WorkOS

txtai

h2ogpt

anything-llm

SecureAI-Tools

embedchain

Related posts

Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com llm chatgpt AI semantic-search openai-api Post date: 24 Dec 2023

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
llm chatgpt AI semantic-search openai-api
Post date: 24 Dec 2023