GPT4.5 or GPT5 being tested on LMSYS?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • pvmigrate

  • All of the facts based queries I have asked so far have not been 100% correct on any LLM including this one.

    Here are some examples of the worst performing:

    "What platform front rack fits a Stromer ST2?": The answer is the Racktime ViewIt. Nothing, not even Google, seems to get this one. Discord gives the right answer.

    "Is there a pre-existing controller or utility to migrate persistent volume claims from one storage class to another in the open source Kubernetes ecosystem?" It said no (wrong) and then provided another approach that partially used Velero that wasn't correct, if you know what Velero does in those particular commands. Discord communities give the right answer, such as `pvmigrate` (https://github.com/replicatedhq/pvmigrate).

    Here is something more representative:

    "What alternatives to Gusto would you recommend? Create a table showing the payroll provider in a column, the base monthly subscription price, the monthly price per employee, and the total cost for 3 full time employees, considering that the employees live in two different states" This and Claude do a good job, but do not correctly retrieve all the prices. Claude omitted Square Payroll, which is really the "right answer" to this query. Google would never be able to answer this "correctly."

  • gpt-3

    Discontinued GPT-3: Language Models are Few-Shot Learners

  • >I wasn't talking about "state of the art LLMs," I am aware that commercial offerings are much better trained in Spanish. This was a thought experiment based on comments from people testing GPT-3.5 with Swahili.

    A thought experiment from other people comments on another language. So...No. Fabricating failure modes from their constructed ideas about how LLMs work seems to be a frustratingly common occurrence in these kinds of discussions.

    >Frustratingly, just few months ago I read a paper describing how LLMs excessively rely on English-language representations of ideas, but now I can't find it.

    Most LLMs are trained on English overwhelmingly. GPT-3 had a 92.6% English dataset. https://github.com/openai/gpt-3/blob/master/dataset_statisti...

    That the models are as proficient as they are is evidence enough of knowledge transfer clearly happening. https://arxiv.org/abs/2108.13349. If you trained a model on the Catalan tokens GPT-3 was trained on alone, you'd just get a GPT-2 level gibberish model at best.

    anyway. These are some interesting papers

    How do languages influence each other? Studying cross-lingual data sharing during LLM fine-tuning - https://arxiv.org/pdf/2305.13286

    Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer - https://arxiv.org/abs/2404.04042

    Multilingual LLMs are Better Cross-lingual In-context Learners with Alignment - https://arxiv.org/abs/2305.05940

    It's not like there is perfect transfer but the idea that there's none at all seemed so ridiculous to me (and why i asked the first question). Models would be utterly useless in multilingual settings if that were really the case.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • FastChat

    An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

  • gpt2-chatbot isn't the only "mystery model" on LMSYS. Another is "deluxe-chat".

    When asked about it in October last year, LMSYS replied [0] "It is an experiment we are running currently. More details will be revealed later"

    One distinguishing feature of "deluxe-chat": although it gives high quality answers, it is very slow, so slow that the arena displays a warning whenever it is invoked

    [0] https://github.com/lm-sys/FastChat/issues/2527

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project