Petaflops to the People: From Personal Compute Cluster to Person of Compute

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • llama.cpp

    LLM inference in C/C++

  • > how everyone is in this mad quantization rush but nobody's putting up benchmarks to show that it works (tinybox is resolutely supporting non quantized LLaMA)

    I don't think this is true. llama.cpp has historically been very conscientious about benchmarking perplexity. Here's a detailed chart of baseline FP16 vs the new k-quants: https://github.com/ggerganov/llama.cpp/pull/1684

    While most evals aren't currently evaluating performance between quantized models, there are two evals that are:

    * Gotzmann LLM Score: https://docs.google.com/spreadsheets/d/1ikqqIaptv2P4_15Ytzro...

    * llm-jeopardy: https://github.com/aigoopy/llm-jeopardy - You can see that the same Airoboros 65B model goes from a score of 81.62% to 80.00% going from an 8_0 to 5_1 quant, and 5_1 solidly beats out the 33B 8_0, as expected.

    Also, GPTQ, SPQR, AWQ, SqueezeLLM all have arXiv papers and every single team is running their own perplexity tests.

    Now, that being said, every code base seems to be calculating perplexity slightly differently. I recently have been working on trying to decode them all for apples-to-apples comparisons between implementations.

  • llm-jeopardy

    Automated prompting and scoring framework to evaluate LLMs using updated human knowledge prompts

  • > how everyone is in this mad quantization rush but nobody's putting up benchmarks to show that it works (tinybox is resolutely supporting non quantized LLaMA)

    I don't think this is true. llama.cpp has historically been very conscientious about benchmarking perplexity. Here's a detailed chart of baseline FP16 vs the new k-quants: https://github.com/ggerganov/llama.cpp/pull/1684

    While most evals aren't currently evaluating performance between quantized models, there are two evals that are:

    * Gotzmann LLM Score: https://docs.google.com/spreadsheets/d/1ikqqIaptv2P4_15Ytzro...

    * llm-jeopardy: https://github.com/aigoopy/llm-jeopardy - You can see that the same Airoboros 65B model goes from a score of 81.62% to 80.00% going from an 8_0 to 5_1 quant, and 5_1 solidly beats out the 33B 8_0, as expected.

    Also, GPTQ, SPQR, AWQ, SqueezeLLM all have arXiv papers and every single team is running their own perplexity tests.

    Now, that being said, every code base seems to be calculating perplexity slightly differently. I recently have been working on trying to decode them all for apples-to-apples comparisons between implementations.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • BIG-bench

    Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

  • I question if perplexity on wikitext is enough, I wonder if someone has checked how well perplexity correlates with https://github.com/google/BIG-bench

    Fundamentally, I'm not sure if perplexity on a dataset that's pretty much leaked everywhere will properly measure the task performance of LLMs. But open to being wrong on this.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts