I made an app that runs Mistral 7B 0.2 LLM locally on iPhone Pros

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

Cgml

22 38 8.6 C++

GPU-targeted vendor-agnostic AI library for Windows, and Mistral model implementation.

Is that explanation better? https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...
Same Mistral Instruct 0.2 model, different implementation.

llama.cpp

773 56,891 10.0 C++

LLM inference in C/C++

3) Not Enough Benefit (For the Cost... Yet!)
This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!
-—-
1) No Neural Engine API
- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.
2) CoreML has challenges modeling LLMs efficiently right now.
- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).
- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.
- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.
3) Not Enough Benefit (For the Cost... Yet!)
- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).
- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.
- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.
I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!
Britt
---
[1] https://github.com/huggingface/exporters/pull/37
[2] https://apple.github.io/coremltools/docs-guides/source/flexi...
[3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...
[4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...
[5] https://github.com/huggingface/swift-transformers
[6] https://github.com/huggingface/exporters
[7] https://developer.apple.com/documentation/metal/gpu_devices_...
[8] https://github.com/ml-explore/mlx/issues/18
[9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...
[10] https://testflight.apple.com/join/ERFxInZg

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
exporters

3 526 7.1 Python

Export Hugging Face models to Core ML and TensorFlow Lite

3) Not Enough Benefit (For the Cost... Yet!)
This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!
-—-
1) No Neural Engine API
- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.
2) CoreML has challenges modeling LLMs efficiently right now.
- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).
- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.
- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.
3) Not Enough Benefit (For the Cost... Yet!)
- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).
- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.
- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.
I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!
Britt
---
[1] https://github.com/huggingface/exporters/pull/37
[2] https://apple.github.io/coremltools/docs-guides/source/flexi...
[3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...
[4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...
[5] https://github.com/huggingface/swift-transformers
[6] https://github.com/huggingface/exporters
[7] https://developer.apple.com/documentation/metal/gpu_devices_...
[8] https://github.com/ml-explore/mlx/issues/18
[9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...
[10] https://testflight.apple.com/join/ERFxInZg

enchanted

5 1,579 8.6 Swift

Enchanted is iOS and macOS app for chatting with private self hosted language models such as Llama2, Mistral or Vicuna using Ollama.
llama_index

75 31,184 10.0 Python

LlamaIndex is a data framework for your LLM applications

Mistral Instruct does use a system prompt.
You can see the raw format here: https://www.promptingguide.ai/models/mistral-7b#chat-templat... and you can see how LllamaIndex uses it here (as an example): https://github.com/run-llama/llama_index/blob/1d861a9440cdc9...

swift-transformers

1 466 8.3 Swift

Swift Package to implement a transformers-like API in Swift

3) Not Enough Benefit (For the Cost... Yet!)
This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!
-—-
1) No Neural Engine API
- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.
2) CoreML has challenges modeling LLMs efficiently right now.
- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).
- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.
- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.
3) Not Enough Benefit (For the Cost... Yet!)
- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).
- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.
- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.
I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!
Britt
---
[1] https://github.com/huggingface/exporters/pull/37
[2] https://apple.github.io/coremltools/docs-guides/source/flexi...
[3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...
[4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...
[5] https://github.com/huggingface/swift-transformers
[6] https://github.com/huggingface/exporters
[7] https://developer.apple.com/documentation/metal/gpu_devices_...
[8] https://github.com/ml-explore/mlx/issues/18
[9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...
[10] https://testflight.apple.com/join/ERFxInZg

mlx

23 14,302 9.8 C++

MLX: An array framework for Apple silicon

3) Not Enough Benefit (For the Cost... Yet!)
This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!
-—-
1) No Neural Engine API
- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.
2) CoreML has challenges modeling LLMs efficiently right now.
- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).
- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.
- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.
3) Not Enough Benefit (For the Cost... Yet!)
- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).
- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.
- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.
I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!
Britt
---
[1] https://github.com/huggingface/exporters/pull/37
[2] https://apple.github.io/coremltools/docs-guides/source/flexi...
[3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...
[4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...
[5] https://github.com/huggingface/swift-transformers
[6] https://github.com/huggingface/exporters
[7] https://developer.apple.com/documentation/metal/gpu_devices_...
[8] https://github.com/ml-explore/mlx/issues/18
[9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...
[10] https://testflight.apple.com/join/ERFxInZg

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
ollama

198 62,615 9.9 Go

Get up and running with Llama 3, Mistral, Gemma, and other large language models.

ollamma https://ollama.ai/ is popular choice for running local llm models and should work fine on intel. It's just wrapping docker so shouldn't require m2/m3.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

May 8, 2024 AI, Machine Learning and Computer Vision Meetup

2 projects | dev.to | 1 May 2024
SB-1047 will stifle open-source AI and decrease safety

2 projects | news.ycombinator.com | 29 Apr 2024
What can LLMs never do?

4 projects | news.ycombinator.com | 27 Apr 2024
Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

1 project | dev.to | 26 Apr 2024
Machine Learning and AI Beyond the Basics Book

1 project | news.ycombinator.com | 16 Apr 2024

I made an app that runs Mistral 7B 0.2 LLM locally on iPhone Pros

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Coreml iOS Artificial intelligence Deep Learning large-language-model
Post date: 7 Jan 2024

Cgml

llama.cpp

InfluxDB

exporters

enchanted

llama_index

swift-transformers

mlx

SaaSHub

ollama

Related posts

May 8, 2024 AI, Machine Learning and Computer Vision Meetup

SB-1047 will stifle open-source AI and decrease safety

What can LLMs never do?

Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

Machine Learning and AI Beyond the Basics Book

I made an app that runs Mistral 7B 0.2 LLM locally on iPhone Pros

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Coreml iOS Artificial intelligence Deep Learning large-language-model Post date: 7 Jan 2024

Related posts

May 8, 2024 AI, Machine Learning and Computer Vision Meetup

SB-1047 will stifle open-source AI and decrease safety

What can LLMs never do?

Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

Machine Learning and AI Beyond the Basics Book

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Coreml iOS Artificial intelligence Deep Learning large-language-model
Post date: 7 Jan 2024