Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

core

12 2,040 9.8 Python

Production ready AI assistant framework (by cheshire-cat-ai)

I haven't personally tried this for anything serious yet, but to get the thread started:
Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.
[0] https://github.com/cheshire-cat-ai/core

Verba

4 4,345 8.9 Python

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba
I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
langroid

15 1,736 9.8 Python

Harness LLMs with Multi-Agent Programming

Many services/platforms are careless/disingenuous when they claim they “train” on your documents, where they actually mean they do RAG.
An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.
In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid

khoj

50 10,579 9.9 Python

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.

I'm a fan of Khoj. Been using it for months. https://github.com/khoj-ai/khoj

private-gpt

131 52,412 9.2 Python

Interact with your documents using the power of GPT, 100% privately, no data leaks

Run https://github.com/imartinez/privateGPT
Then
make ingest /path/to/folder/with/files
Then chat to the LLM.
Done.
Docs: https://docs.privategpt.dev/overview/welcome/quickstart

gpt4all

139 65,406 9.8 C++

gpt4all: run open-source LLMs anywhere

Gpt4all is a local desktop app with a Python API that can be trained on your documents: https://gpt4all.io/

gpt-researcher

5 11,784 9.7 Python

GPT based autonomous agent that does online comprehensive research on any given topic

Hey, GPT Researcher shows exactly how to do that with RAG. See here https://github.com/assafelovic/gpt-researcher

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
txtai

356 7,211 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Since no one has mentioned it so far: I did just this recently with txtai in a few lines of code.
https://neuml.github.io/txtai/

h2ogpt

29 10,801 10.0 Python

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

As others have said you want RAG.
The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).
The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.
It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.
You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.
[0] - https://github.com/h2oai/h2ogpt

anything-llm

21 15,083 9.8 JavaScript

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.

anything-llm looks pretty interesting and easy to use https://github.com/Mintplex-Labs/anything-llm

SecureAI-Tools

12 1,435 8.7 TypeScript

Private and secure AI tools for everyone's productivity.

Try https://github.com/SecureAI-Tools/SecureAI-Tools -- it's an open-source application layer for Retrieval-Augmented Generation (RAG). It allows you to use any LLM -- you can use OpenAI APIs, or run models locally with Ollama.

embedchain

6 8,685 9.7 Python

Personalizing LLM Responses

You can use embedchain[1] to connect various data sources and then get a RAG application running on your local and production very easily. Embedchain is an open source RAG framework and It follows a conventional but configurable approach.
The conventional approach is suitable for software engineer where they may not be less familiar with AI. The configurable approach is suitable for ML engineer where they have sophisticated uses and would want to configure chunking, indexing and retrieval strategies.
[1]: https://github.com/embedchain/embedchain

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

RAG with Groq and Llama 3

1 project | news.ycombinator.com | 31 May 2024
Kubernetes and AI: 3 Open Source Tools Powered by OpenAI

2 projects | dev.to | 25 May 2024
Multi AI Agent Systems Using OpenAI's New GPT-4o Model

4 projects | news.ycombinator.com | 17 May 2024
I'm puzzled how anyone trusts ChatGPT for code

4 projects | news.ycombinator.com | 8 May 2024
Alternative Chunking Methods

1 project | news.ycombinator.com | 30 Apr 2024

Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
llm chatgpt AI semantic-search openai-api
Post date: 24 Dec 2023

core

Verba

Scout Monitoring

langroid

khoj

private-gpt

gpt4all

gpt-researcher

InfluxDB

txtai

h2ogpt

anything-llm

SecureAI-Tools

embedchain

Related posts

RAG with Groq and Llama 3

Kubernetes and AI: 3 Open Source Tools Powered by OpenAI

Multi AI Agent Systems Using OpenAI's New GPT-4o Model

I'm puzzled how anyone trusts ChatGPT for code

Alternative Chunking Methods

Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com llm chatgpt AI semantic-search openai-api Post date: 24 Dec 2023

Related posts

RAG with Groq and Llama 3

Kubernetes and AI: 3 Open Source Tools Powered by OpenAI

Multi AI Agent Systems Using OpenAI's New GPT-4o Model

I'm puzzled how anyone trusts ChatGPT for code

Alternative Chunking Methods

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
llm chatgpt AI semantic-search openai-api
Post date: 24 Dec 2023