-
pyllms
Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
claude-instant-v1 is one of the "best kept secrets".
Comperable quality to gpt-3.5-turno but four times faster and at half the price (which was ridicolous cheap to begin with).
We made this simple library [1] to talk to various LLMs (openai, anthropic, AI21..) and as a part of that we designed an LLM benchmark. All open source.
[1] https://github.com/kagisearch/pyllms/blob/main/README.md#ben...
If Apple would wake up to what's happening with llama.cpp etc then I don't see such a big role for paying for remote access to big models via API
Currently a Macbook has a Neural Engine that is sitting idle 99% of the time and only suitable for running limited models (poorly documented, opaque rules about what ops can be accelerated, a black box compiler [1] and an apparent 3GB model size limit [2])
OTOH you can buy a Macbook with 64GB 'unified' memory and a Neural Engine today
If you squint a bit and look into the near future it's not so hard to imagine a future Mx chip with a more capable Neural Engine and yet more RAM, and able to run the largest GPT3 class models locally. (Ideally with better developer tools so other compilers can target the NE)
And then imagine it does that while leaving the CPU+GPU mostly free to run apps/games ... the whole experience of using a computer could change radically in that case.
I find it hard not to think this is coming within 5 years (although equally, I can imagine this is not on Apple's roadmap at all currently)
[1] https://github.com/hollance/neural-engine
There seems to be a limit to the size of model you can load before CoreML decides it has to run on CPU instead (see the second link in my previous comment)
If it could use the full 'unified' memory that would be a big step towards getting these models running on it (see the second link in my original comment)
I'm unsure how the performance compares to a beefy Intel CPU, but there's some numbers here [1] for running a variant of the small distilbert-base model on the Neural Engine... it's ~10x faster than running on the M1 CPU
[1] https://github.com/anentropic/experiments-coreml-ane-distilb...
There seems to be a limit to the size of model you can load before CoreML decides it has to run on CPU instead (see the second link in my previous comment)
If it could use the full 'unified' memory that would be a big step towards getting these models running on it (see the second link in my original comment)
I'm unsure how the performance compares to a beefy Intel CPU, but there's some numbers here [1] for running a variant of the small distilbert-base model on the Neural Engine... it's ~10x faster than running on the M1 CPU
[1] https://github.com/anentropic/experiments-coreml-ane-distilb...
Related posts
-
M2 Ultra can run 128 streams of Llama 2 7B in parallel
-
Apple is adding more and more neural engine cores to their products, is there any way to use them for local LLMs?
-
Is it possible to use ANE(Apple Neural Engine) to run those models?
-
Everything we actually know about the Apple Neural Engine (ANE)
-
Everything we actually know about the Apple Neural Engine (ANE)