Hi folks, back with an update to the HumanEval+ programming ranking I posted the other day incorporating your feedback - and some closed models for comparison! Now has improved generation params, new models: Falcon, Starcoder, Codegen, Claude+, Bard, OpenAssistant and more

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • llm-humaneval-benchmarks

  • Additionally, I found one additional unnecessary whitespace in both my Alpaca and Vicuna prompts and got rid of it to match the recommended prompts better. Third, I tested a much broader set of prompt configurations. For each model in the chart above, I only included the best prompt configuration (which is marked after the model). You can find the corresponding prompts here: https://github.com/my-other-github-account/llm-humaneval-benchmarks/blob/main/templates.py

  • can-ai-code

    Self-evaluating interview for AI coders

  • Starcoder/Codegen: As you all expected, the coding models do quite well at code! Of the OSS models these perform the best. I still fall a few percent short of the advertised HumanEval+ results that some of these provide in their papers using my prompt, settings, and parser - but it is important to note that I am simply counting the pass rate of single attempts for each of these models. So this is not directly comparable to the pass@1 metric as defined in the Codex paper (for reasons they discuss in said paper) - my N is 1, their N is 200, so if you see anyone provide pass@1 in their peer reviewed papers those results will be more reliable than mine - and mine are expected to have higher variance. Also, in the case of Starcoder am using an IFT variation of their model - so it is slightly different than the version in their paper - as it is more dialogue tuned. I expected Starcoderplus to outperform Starcoder, but it looks like it is actually expected to perform worse at Python - as it is a generalist model - and better at everything else instead. There is a great benchmark here in development that is working on multiple languages (and unlike HumanEval is also not developed by OpenAI - which is a huge plus in my book) - so this will be interesting to keep an eye on especially for models like Starcoderplus: https://github.com/the-crypt-keeper/can-ai-code

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • WizardLM

    Discontinued Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder and WizardMath

  • I just saw this WizardCoder: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/README.md

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

    1 project | news.ycombinator.com | 21 May 2024
  • Knowledge Base Support for the Generic Bedrock Agent Test UI

    1 project | dev.to | 21 May 2024
  • Ask HN: How does modern FreeCAD compare with Solidworks?

    8 projects | news.ycombinator.com | 21 May 2024
  • Show HN: Empower-functions, SOTA OSS function calling LLM

    1 project | news.ycombinator.com | 21 May 2024
  • We created the first open-source implementation of Meta's TestGenā€“LLM

    4 projects | news.ycombinator.com | 21 May 2024