Show HN: We got fine-tuning Mistral-7B to not suck

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

unsloth

15 7,263 9.4 Python

Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory

Unsloth’s colab notebooks for fine-tuning Mistral-7B are super easy to use and run fine in just about any colab instance:
https://github.com/unslothai/unsloth
It’s my default now for experimenting and basic training. If I want to get into the weeds with the training, I use axolotl, but 9/10, it’s not really necessary.

helix

1 201 9.9 Go

Multi-node production AI stack. Run the best of open source AI easily on your own servers. Create your own AI by fine-tuning open source models (by helixml)

If you look at the source [1] you can see how they solved their what are the doctors going to do problem. It is literally included in one of the prompts now:
Users tend to ask broad, vague questions of the document in order to test that the system is working. We want those queries to work well. For example, a user would ask "what are the doctors going to do?" of a document that is about a junior doctors' strike. Take this into account when generating the questions - in particular, refer to noun phrases by less specific descriptions, so for example instead of "junior doctors", say "doctors" in your questions.
[1]: https://github.com/helixml/helix/blob/main/api/pkg/dataprep/...

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
lamini-sdk

2 51 6.8 Python

https://github.com/lamini-ai/lamini-sdk/tree/main/03_RAG
You can implement RAG in 80 lines of python and 0 libraries.

vllm

30 18,571 9.9 Python

A high-throughput and memory-efficient inference and serving engine for LLMs

Great question! scheduling workloads onto GPUs in a way where VRAM is being utilised efficiently was quite the challenge.
What we found was the IO latency for loading model weights into VRAM will kill responsiveness if you don't "re-use" sessions (i.e. where the model weights remain loaded and you run multiple inference sessions over the same loaded weights).
Obviously projects like https://github.com/vllm-project/vllm exist but we needed to build out a scheduler that can run a fleet of GPUs for a matrix of text/image vs inference/finetune sessions.
disclaimer: I work on Helix

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project