I made an app that runs Mistral 7B 0.2 LLM locally on iPhone Pros

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Cgml

    GPU-targeted vendor-agnostic AI library for Windows, and Mistral model implementation.

  • Is that explanation better? https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

    Same Mistral Instruct 0.2 model, different implementation.

  • llama.cpp

    LLM inference in C/C++

  • 3) Not Enough Benefit (For the Cost... Yet!)

    This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!

    -—-

    1) No Neural Engine API

    - There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.

    2) CoreML has challenges modeling LLMs efficiently right now.

    - Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).

    - CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.

    - HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.

    3) Not Enough Benefit (For the Cost... Yet!)

    - ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).

    - So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.

    - Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.

    I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!

    Britt

    ---

    [1] https://github.com/huggingface/exporters/pull/37

    [2] https://apple.github.io/coremltools/docs-guides/source/flexi...

    [3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...

    [4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...

    [5] https://github.com/huggingface/swift-transformers

    [6] https://github.com/huggingface/exporters

    [7] https://developer.apple.com/documentation/metal/gpu_devices_...

    [8] https://github.com/ml-explore/mlx/issues/18

    [9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...

    [10] https://testflight.apple.com/join/ERFxInZg

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • exporters

    Export Hugging Face models to Core ML and TensorFlow Lite

  • 3) Not Enough Benefit (For the Cost... Yet!)

    This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!

    -—-

    1) No Neural Engine API

    - There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.

    2) CoreML has challenges modeling LLMs efficiently right now.

    - Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).

    - CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.

    - HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.

    3) Not Enough Benefit (For the Cost... Yet!)

    - ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).

    - So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.

    - Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.

    I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!

    Britt

    ---

    [1] https://github.com/huggingface/exporters/pull/37

    [2] https://apple.github.io/coremltools/docs-guides/source/flexi...

    [3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...

    [4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...

    [5] https://github.com/huggingface/swift-transformers

    [6] https://github.com/huggingface/exporters

    [7] https://developer.apple.com/documentation/metal/gpu_devices_...

    [8] https://github.com/ml-explore/mlx/issues/18

    [9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...

    [10] https://testflight.apple.com/join/ERFxInZg

  • enchanted

    Enchanted is iOS and macOS app for chatting with private self hosted language models such as Llama2, Mistral or Vicuna using Ollama.

  • llama_index

    LlamaIndex is a data framework for your LLM applications

  • Mistral Instruct does use a system prompt.

    You can see the raw format here: https://www.promptingguide.ai/models/mistral-7b#chat-templat... and you can see how LllamaIndex uses it here (as an example): https://github.com/run-llama/llama_index/blob/1d861a9440cdc9...

  • swift-transformers

    Swift Package to implement a transformers-like API in Swift

  • 3) Not Enough Benefit (For the Cost... Yet!)

    This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!

    -—-

    1) No Neural Engine API

    - There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.

    2) CoreML has challenges modeling LLMs efficiently right now.

    - Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).

    - CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.

    - HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.

    3) Not Enough Benefit (For the Cost... Yet!)

    - ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).

    - So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.

    - Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.

    I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!

    Britt

    ---

    [1] https://github.com/huggingface/exporters/pull/37

    [2] https://apple.github.io/coremltools/docs-guides/source/flexi...

    [3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...

    [4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...

    [5] https://github.com/huggingface/swift-transformers

    [6] https://github.com/huggingface/exporters

    [7] https://developer.apple.com/documentation/metal/gpu_devices_...

    [8] https://github.com/ml-explore/mlx/issues/18

    [9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...

    [10] https://testflight.apple.com/join/ERFxInZg

  • mlx

    MLX: An array framework for Apple silicon

  • 3) Not Enough Benefit (For the Cost... Yet!)

    This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!

    -—-

    1) No Neural Engine API

    - There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.

    2) CoreML has challenges modeling LLMs efficiently right now.

    - Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).

    - CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.

    - HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.

    3) Not Enough Benefit (For the Cost... Yet!)

    - ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).

    - So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.

    - Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.

    I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!

    Britt

    ---

    [1] https://github.com/huggingface/exporters/pull/37

    [2] https://apple.github.io/coremltools/docs-guides/source/flexi...

    [3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...

    [4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...

    [5] https://github.com/huggingface/swift-transformers

    [6] https://github.com/huggingface/exporters

    [7] https://developer.apple.com/documentation/metal/gpu_devices_...

    [8] https://github.com/ml-explore/mlx/issues/18

    [9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...

    [10] https://testflight.apple.com/join/ERFxInZg

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • ollama

    Get up and running with Llama 3, Mistral, Gemma, and other large language models.

  • ollamma https://ollama.ai/ is popular choice for running local llm models and should work fine on intel. It's just wrapping docker so shouldn't require m2/m3.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • May 8, 2024 AI, Machine Learning and Computer Vision Meetup

    2 projects | dev.to | 1 May 2024
  • SB-1047 will stifle open-source AI and decrease safety

    2 projects | news.ycombinator.com | 29 Apr 2024
  • What can LLMs never do?

    4 projects | news.ycombinator.com | 27 Apr 2024
  • Voxel51 Is Hiring AI Researchers and Scientists — What the New Open Science Positions Mean

    1 project | dev.to | 26 Apr 2024
  • Machine Learning and AI Beyond the Basics Book

    1 project | news.ycombinator.com | 16 Apr 2024