-
tinygrad
Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
MeZO
[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
-
willow
Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative
-
willow-inference-server
Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Its graph execution is still full of busyloops, e.g.:
https://github.com/ggerganov/llama.cpp/blob/44f906e8537fcec9...
I wonder how much more efficient it would be when Taskflow lib was used instead, or even inteltbb.
Might be a silly question but is GGML a similar/competing library to George Hotz's tinygrad [0]?
[0] https://github.com/geohot/tinygrad
If MeZO gets implemented, we are basically there: https://github.com/princeton-nlp/MeZO
I don't know... Hippo is closed source for now.
Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm
But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop fan just barely, but very practically, run LLaMA 30B at almost 3 tokens/s. That is crazy!
With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.
Shameless plug, I'm the founder of Willow[0].
In short you can:
1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.
2) Run local command detection on device. We pull your Home Assistant entites on setup and define basic grammar for them but any English commands (up to 400) that can be processed by Home Assistant are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.
Whether WIS or local our performance target is 500ms from end of speech to command executed.
[0] - https://github.com/toverainc/willow
[1] - https://github.com/toverainc/willow-inference-server
Shameless plug, I'm the founder of Willow[0].
In short you can:
1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.
2) Run local command detection on device. We pull your Home Assistant entites on setup and define basic grammar for them but any English commands (up to 400) that can be processed by Home Assistant are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.
Whether WIS or local our performance target is 500ms from end of speech to command executed.
[0] - https://github.com/toverainc/willow
[1] - https://github.com/toverainc/willow-inference-server
whisper.cpp is optimized for Apple Silicon and is available as a Swift package
https://github.com/ggerganov/whisper.spm
Related posts
-
VLLM: 24x faster LLM serving than HuggingFace Transformers
-
Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST
-
Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old
-
Now I Can Just Print That Video
-
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller