-
> We can achieve up to 100 tokens per second single-stream while GPT-4 runs around 20 tokens per second at best.
Is that with batching? If so, thats quite impressive.
> certain challenging questions where it is capable of getting the right answer, the Phind Model might take more generations to get to the right answer than GPT-4.
Some of this is sampler tuning. Y'all should look at grammar based sampling if you aren't using it already, as well as some of the "dynamic" sampling like mirostat and dynatemp: https://github.com/LostRuins/koboldcpp/pull/464
I'd guess you want a low temperature for coding, but its still a tricky balance.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
It's definitely not impossible at least.
Someone is doing it in python here:
https://pyatv.dev/
GPT-4 actually sent me here:
"Here is an example of a C# library that implements the HAP: CSharp.HomeKit (https://github.com/brutella/hkhomekit). You can use this library as a reference or directly use it in your project."
Which, to no surprise based on my experiences with LLMs for programming does not exist and doesn't seem to have ever existed.
I get that they aren't magic, but I guess I am just bad at trying to use LLMs to help in my programming. Apparently all I do are obscure things or something. Or I am just not good enough at prompting. But I feel like that's also a reflection of the weakness of an LLM in that it needs such perfect and specific prompting to get good answers.
-
It's definitely not impossible at least.
Someone is doing it in python here:
https://pyatv.dev/
GPT-4 actually sent me here:
"Here is an example of a C# library that implements the HAP: CSharp.HomeKit (https://github.com/brutella/hkhomekit). You can use this library as a reference or directly use it in your project."
Which, to no surprise based on my experiences with LLMs for programming does not exist and doesn't seem to have ever existed.
I get that they aren't magic, but I guess I am just bad at trying to use LLMs to help in my programming. Apparently all I do are obscure things or something. Or I am just not good enough at prompting. But I feel like that's also a reflection of the weakness of an LLM in that it needs such perfect and specific prompting to get good answers.
-
-
Without batching, I was actually thinking that's kind of modest.
ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:
https://github.com/turboderp/exllamav2#performance
I didn't test codellama, but the 3090 TI figures are in the ballpark of my generation speed on a 3090.
-
Too much money being thrown around on BS in the LLM space, hardly any of it is going to places where it matters.
For example, the researchers working hard on better text sampling techniques, or on better constraint techniques (i.e. like this https://arxiv.org/abs/2306.03081), or on actual negative prompting/CFG in LLMs (i.e. like this https://github.com/huggingface/transformers/issues/24536) are doing far FAR more to advance the state of AI than dozens of VC backed LLM "prompt engineering" companies operating today.
HN, and the NLP community have some serious blindspots with knowing how to exploit their own technology. At least someone at Anderson Howartz got a clue and gave some funding to Oogabooga - still waiting for Automatic1111 to get any funding.
-
-
-
ChatGPT-AutoExpert
🚀🧠💬 Supercharged Custom Instructions for ChatGPT (non-coding) and ChatGPT Advanced Data Analysis (coding).
Take a look at the AutoExpert custom instructions: https://github.com/spdustin/ChatGPT-AutoExpert
It lets you specify verbosity from 1 to 5 (e.g. "V=1" in the prompt). Sometimes it will just ignore that, but it actually does work most of the time. I use a verbosity of 1 or 2 when I just want a quick answer.