Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
tinygrad
Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)
-
coral-pi-rest-server
Perform inferencing of tensorflow-lite models on an RPi with acceleration from Coral USB stick
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Apple should get working on a version of the Neural Engine that is useful for these models, and remove the 3GB size limit [1] to take full advantage of the 'unified' memory architecture. Game changer.
Waste of die space currently
[1] https://github.com/smpanaro/more-ane-transformers/blob/main/...
You can see for yourself (assuming you have the model weights) https://github.com/abetlen/llama-cpp-python
I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with an 14 core intel i7-9750 processor. Because it's CPU inference the initial prompt takes longer to process so total latency is still higher than I'd like but I'm working on some caching solutions that should make this bareable for things like chat.
Also worth checking out https://github.com/saharNooby/rwkv.cpp which is based on Georgi's library and offers support for the RWKV family of models which are Apache-2.0 licensed.
I’ve got some of their smaller Raven models running locally on my M1 (only 16GB of RAM).
I’m also in the middle of making it user friendly to run these models on all platforms (built with Flutter). First MacOS release will be out before this weekend: https://github.com/BrutalCoding/shady.ai
tinygrad
https://github.com/geohot/tinygrad/tree/master/accel/ane
But I have not tested it on Linux since Asahi has not yet added support.
llama.cpp runs at 18ms per token (7B) and 200ms per token (65B) without quantization.
These ones can be plugged in with USB type c.
https://coral.ai/products/accelerator/
Used for boosting interference (offline) on Linux, Mac and Windows
Haven’t bought or used them but I’ve had my eyes on these for a little while!