-
haltt4llm
This project is an attempt to create a common metric to test LLM's for progress in eliminating hallucinations which is the most serious current problem in widespread adoption of LLM's for many real purposes.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Hmm, but if I'm reading this code correctly, it's also correct if the text of the correct answer appears anywhere in the output. Even if other incorrect answers also appear.
https://github.com/manyoso/haltt4llm/blob/main/take_test.py#...
So the above answer would have been correct were it not for the fact that it said "doubled its" rather than "double it".
Without seeing the log of answers marked correct, I'm skeptical that GPT4All, which seems to produce rambling prose for all of its incorrect answers, is actually picking one of the multiple choice options the rest of the time. It seems like a model could get 100% 'correct' just by repeating back all five options.