Hi folks, back with an update to the HumanEval+ programming ranking I posted the other day incorporating your feedback - and some closed models for comparison! Now has improved generation params, new models: Falcon, Starcoder, Codegen, Claude+, Bard, OpenAssistant and more

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

llm-humaneval-benchmarks

10 83 4.9 Jupyter Notebook

Additionally, I found one additional unnecessary whitespace in both my Alpaca and Vicuna prompts and got rid of it to match the recommended prompts better. Third, I tested a much broader set of prompt configurations. For each model in the chart above, I only included the best prompt configuration (which is marked after the model). You can find the corresponding prompts here: https://github.com/my-other-github-account/llm-humaneval-benchmarks/blob/main/templates.py

can-ai-code

30 451 9.5 Python

Self-evaluating interview for AI coders

Starcoder/Codegen: As you all expected, the coding models do quite well at code! Of the OSS models these perform the best. I still fall a few percent short of the advertised HumanEval+ results that some of these provide in their papers using my prompt, settings, and parser - but it is important to note that I am simply counting the pass rate of single attempts for each of these models. So this is not directly comparable to the pass@1 metric as defined in the Codex paper (for reasons they discuss in said paper) - my N is 1, their N is 200, so if you see anyone provide pass@1 in their peer reviewed papers those results will be more reliable than mine - and mine are expected to have higher variance. Also, in the case of Starcoder am using an IFT variation of their model - so it is slightly different than the version in their paper - as it is more dialogue tuned. I expected Starcoderplus to outperform Starcoder, but it looks like it is actually expected to perform worse at Python - as it is a generalist model - and better at everything else instead. There is a great benchmark here in development that is working on multiple languages (and unlike HumanEval is also not developed by OpenAI - which is a huge plus in my book) - so this will be interesting to keep an eye on especially for models like Starcoderplus: https://github.com/the-crypt-keeper/can-ai-code

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
WizardLM

38 7,531 9.4 Python

Discontinued Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder and WizardMath

I just saw this WizardCoder: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/README.md

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

1 project | news.ycombinator.com | 21 May 2024
Knowledge Base Support for the Generic Bedrock Agent Test UI

1 project | dev.to | 21 May 2024
Ask HN: How does modern FreeCAD compare with Solidworks?

8 projects | news.ycombinator.com | 21 May 2024
Show HN: Empower-functions, SOTA OSS function calling LLM

1 project | news.ycombinator.com | 21 May 2024
We created the first open-source implementation of Meta's TestGen–LLM

4 projects | news.ycombinator.com | 21 May 2024