transformers
llama.cpp
transformers | llama.cpp | |
---|---|---|
220 | 921 | |
148,940 | 85,794 | |
1.0% | 2.7% | |
10.0 | 10.0 | |
3 days ago | 2 days ago | |
Python | C++ | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
transformers
- Transformers 4.55 New OpenAI GPT OSS
-
OpenAI Harmony
The new transformers release describes the model: https://github.com/huggingface/transformers/releases/tag/v4....
> GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.
-
How to Install Devstral Small 1.1 Locally?
pip install torch pip install git+https://github.com/huggingface/transformers pip install git+https://github.com/huggingface/accelerate pip install huggingface_hub pip install --upgrade vllm pip install --upgrade mistral_common chal
-
How to Install DeepSeek Nano-VLLM Locally?
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install git+https://github.com/huggingface/transformers pip install git+https://github.com/huggingface/accelerate pip install huggingface_hub
-
Medical RAG Research with txtai
Substitute your own embeddings database to change the knowledge base. txtai supports running local LLMs via transformers or llama.cpp. It also supports a wide variety of LLMs via LiteLLM. For example, setting the 2nd RAG pipeline parameter below to gpt-4o along with the appropriate environment variables with access keys switches to a hosted LLM. See this documentation page for more on this.
- What Are Vision-Language Models (VLMs) and How Do They Work?
-
I have reimplemented Stable Diffusion 3.5 from scratch in pure PyTorch
Reference implementations are unmaintained and buggy.
For example https://github.com/huggingface/transformers/issues/27961 OpenAI's tokenizer for CLIP is buggy, it's a reference implementation, it isn't the one they used for training, and the problems with it go unsolved and get copied endlessly by other projects.
What about Flux? They don't say it was used for training, it wasn't, there are bugs with it that break cudagraphs or similar that aren't that impactful. On the other hand, it uses CLIP, and CLIP is buggy, so this is buggy...
- HuggingFace transformers will focus on PyTorch, deprecating TensorFlow and Flax
-
None of the top 10 projects in GitHub is actually a software project 🤯
We see an addition to the AI community with AutoGPT. Along with Tensorflow they represent the AI community in the software category, which is getting relevant (2 out of 8). We can expect in the future to have new AI projects in the top 25 such as Transformers or Ollama (currently top 34 and 36, respectively).
-
How to Install Foundation-Sec 8B by Cisco: The Ultimate Cybersecurity AI Model
pip install torch pip install git+https://github.com/huggingface/transformers pip install git+https://github.com/huggingface/accelerate pip install huggingface_hub
llama.cpp
-
DeepSeek-v3.1 Release
I maintain a cross-platform llama.cpp client - you're right to point out that generally we expect nuking logits can take care of it.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
- Guide: Running GPT-OSS with Llama.cpp
-
Ollama and gguf
ik_llama.cpp is another fork of llama.cpp. I followed the development of GLM4.5 support in both projects.
The ik_llama.cpp developers had a working implementation earlier than llama.cpp, but their GGUFs were not compatible with the mainline.
After the changes in llama.cpp were merged into master, ik_llama.cpp reworked their implementation and ported it to align with upstream: https://github.com/ggml-org/llama.cpp/pull/14939#issuecommen...
>Many thanks to @sammcj, @CISC, and everyone who contributed! The code has been successfully ported and merged into ik_llama.
This is how it should be done.
-
How to Install & Run GPT-OSS 20b and 120b GGUF Locally?
apt-get update apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev git git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server cp llama.cpp/build/bin/llama-* llama.cpp/
-
Mistral Integration Improved in Llama.cpp
llama.cpp still doesn't support gpt-oss tool calling. https://github.com/ggml-org/llama.cpp/pull/15158 (among other similar PRs)
But I also couldn't get vllm, or transformers serve, or ollama (400 response on /v1/chat/completions) working today with gpt-oss. OpenAI's cookbooks aren't really copy paste instructions. They probably tested on a single platform with preinstalled python packages which they forgot to mention :))
-
How Attention Sinks Keep Language Models Stable
There was skepticism last time this was posted https://news.ycombinator.com/item?id=37740932
Implementation for gpt-oss this week showed 2-3x improvements https://github.com/ggml-org/llama.cpp/pull/15157 https://www.reddit.com/r/LocalLLaMA/comments/1mkowrw/llamacp...
-
OpenAI Open Models
Holy smokes, there's already llama.cpp support:
https://github.com/ggml-org/llama.cpp/pull/15091
- Llama.cpp: Add GPT-OSS
-
My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)
MLX does have good software support. Targeting both iOS and mac is a big win in itself.
I wonder what's possible, what the software situation is today with the PC NPU's. AMD's XDNA has been around for a while, XDNA2 jumps from 10->40 TOps. The "AMDXDNA" driver merged in 6.14 last winter: where are we now?
But not seeing any evidence that there's popular support in any of the main frameworks. https://github.com/ggml-org/llama.cpp/issues/1499 https://github.com/ollama/ollama/issues/5186
Good news, AMD has an initial implementation of llama.cpp. I don't particularly know what it means, but the firt gen supports W4ABF16 quantization, newer chips support W8A16. https://github.com/ggml-org/llama.cpp/issues/14377 . I'm not sure what it's good for, but there is a Linux "xdna-driveR", https://github.com/amd/xdna-driver
Would also be interesting to know how this compares to say the huge iGPU on Strix Halo. I don't know these NPUs similarly work with very large models.
There's a lot of other folks also starting on their NPU journeys. ARM's Ethos, and Rockchip's RKNN recently shipped Linux kernel drivers, but it feels like that's just a start? https://www.phoronix.com/news/Arm-Ethos-NPU-Accel-Driver https://www.phoronix.com/news/Rockchip-NPU-Driver-RKNN-2025
- AMD teams contributing to the llama.cpp codebase
What are some alternatives?
sentence-transformers - State-of-the-Art Text Embeddings
ollama - Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
llama - Inference code for Llama models
mlc-llm - Universal LLM Deployment Engine with ML Compilation
text-generation-webui - LLM UI with advanced features, easy setup, and multiple backend support.