ragrank
uptrain
ragrank | uptrain | |
---|---|---|
1 | 35 | |
23 | 2,059 | |
- | 2.9% | |
9.5 | 9.6 | |
21 days ago | 8 days ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
ragrank
-
I created Ragrank 🎯- An open source ecosystem to evaluate LLM and RAG.
Feel free to contribute on GitHub 💚
uptrain
-
A Developer's Guide to Evaluating LLMs!
You can create an account with UpTrain and generate the API key for free. Please visit https://uptrain.ai/
-
Evaluation of OpenAI Assistants
Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb
-
Integrating Spade: Synthesizing Assertions for LLMs into My OSS Project
d. Using an integer programming optimizer to find the optimal evaluation set with maximum coverage and respect failure, accuracy, and subsumption constraints
Their results are impressive. You can look at the SPADE paper for more details: https://arxiv.org/pdf/2401.03038.pdf
2. Running these evaluations reliably is tricky: Recently, using LLMs as evaluators has emerged as a promising alternative to human evaluations and has proven quite effective in improving the accuracy of LLM applications. However, difficulties still exist when running these evals reliably, i.e. high correlation with human judgments and stability across multiple runs. UpTrain is an open-source framework for evaluating LLM applications that provide high-quality scores. It allows one to define custom evaluations via GuidelineAdherence check, where one can determine any custom guideline in plain English and check if the LLM follows it. Additionally, it provides an easy interface to run these evaluations on production responses with a single API call. This allows one to systematically leverage frameworks like UpTrain to check for wrong LLM outputs.
I am one of the maintainers of UpTrain, and we recently integrated the SPADE framework into our open-source repo (https://github.com/uptrain-ai/uptrain/). The idea is simple:
-
Sharing learnings from evaluating Million+ LLM responses
b. Task Dependent: Tonality match with the given persona, creativity, interestingness, etc. Your prompt can play a big role here
3. Evaluating Reasoning Capabilities: Includes dimensions like logical correctness (right conclusions), logical robustness (consistent with minor input changes), logical efficiency (shortest solution path), and common sense understanding (grasping common concepts). One can’t do much beyond prompting techniques like CoT and primarily depends upon the LLM chosen.
4. Custom Evaluations: Many applications require customized metrics tailored to their specific needs. You want adherence to custom guidelines, check for certain keywords, etc.
You can read the full blog here (https://uptrain.ai/blog/how-to-evaluate-your-llm-applications). Hope you find it useful. I am one of the developer of UpTrain - it is an open-source package to evaluate LLM applications (https://github.com/uptrain-ai/uptrain).
Would love to get feedback from the HN community.
- Show HN: UpTrain (YC W23) – open-source tool to evaluate LLM response quality
-
Introducing UpTrain - Open-source LLM evaluator 🔎
Open-source repo: https://github.com/uptrain-ai/uptrain
-
Launching UpTrain - an open-source LLM testing tool to help check the performance of your LLM applications
You can check out the project - https://github.com/uptrain-ai/uptrain and would love to hear feedback from the community
- [P] A Practical Guide to Enhancing Models for Custom Use-cases
-
[D] Any options for using GPT models using proprietary data ?
I am building an open source project which helps in collecting the high quality retraining dataset for fine-tuning LLMs. Check out https://github.com/uptrain-ai/uptrain
-
[D] Should we draw inspiration from Deep learning/Computer vision world for fine-tuning LLMs?
P.S. I am building an open-source project UpTrain (https://github.com/uptrain-ai/uptrain), which helps data scientists to do so. We just wrote a blog on how this principle can be applied to fine-tune an LLM for a conversation summarization task. Check it out here: https://github.com/uptrain-ai/uptrain/tree/main/examples/coversation_summarization
What are some alternatives?
lora - Using Low-rank adaptation to quickly fine-tune diffusion models.
stanford_alpaca - Code and documentation to train Stanford's Alpaca models, and generate the data.
aim - Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
deepchecks - Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.
nannyml - nannyml: post-deployment data science in python
frouros - Frouros: an open-source Python library for drift detection in machine learning systems.