Experimenting with Local LLMs on macOS
I agree that it's kind of magical that you can download a ~10GB file and suddenly your laptop is running something that can summarize text, answer questions and even reason a bit.
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
So far I've not run into the kind of use cases that local LLMs can convincingly provide without making me feel like I'm using the first ever ChatGPT from 2022, in that they are limited and quite limiting. I am curious about what use cases the community has found that work for them. The example that one user has given in this thread about their local LLM inventing a Sun Tzu interview is exactly the kind of limitation I'm talking about. How does one use a local LLM to do something actually useful?
I'm running Hermes Mistral and the very first thing it did was start hallucinating.
I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs. I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
I don't think we're anywhere close to running cutting-edge LLMs on our phones or laptops.
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
I believe local llms are the future. It will only get better. Once we get to the level of even last year's state of the art I don't see any reason to use chatgpt/anthropic/other.
We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
Unrelated but I really enjoyed the wavy text effect on “opinions” in the first paragraph
+1 to LM Studio. Helped build a lot of intuition.
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
Great way to spend an hour or two.
Every blog post or article about running local LLMs should include something about which hardware was used.
Check out Osaurus - MIT Licensed, native, Apple Silicon–only local LLM server - https://github.com/dinoki-ai/osaurus
Is anyone working on software that lets you run local LLMs in the browser?
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
It's a crazy upside-down world where the Mac Studio M3 Ultra 512GB is the reasonable option among the alternatives if you intend to run larger models at usable(ish) speeds.
The use of the word "emergent" is concerning to me. I believe this to be an... exaggeration of the observed effect. Depending on the perspective and the knowledge of the domain, this might seem to some ad emergent, however we saw equally interesting developments with more complex Markov chaining given the sheer lack of computational resources and time. What we are observing is just another step up that ladder, another angle to enumerate and pick the best token next in the sequence given the information revealed by the proceeding words. Linguistics is all about efficient, lossless data-transfer. While it's "cool" and very surprising.. I don't believe we should be treating it as somewhere between a spell-checker and a sentient being. People aren't simple heuristic models, and to imply these machines are remotely close is woefully inaccurate and will lead to further confusion and disappointment in the future.
My main concern with running LLMs locally so far is that it absolutely kills your battery if you're constantly inferencing.
I really like On-Device AI on iPhone (also runs on Mac): https://ondevice-ai.app in addition to LM Studio. It has a nice interface, with multiple prompt integration, and a good selection of models. Also the developer is responsive.
As someone who sometimes downloads random models to play around on my 16GB Mac Mini, I like his suggestions of models. I guess these are the best ones for their sizes if you get down to 4 or 5 worth keeping.
DEVONThink 4’s support for local models is great and could possibly contribute to the software’s enduring success for the next 10 years. I’ve found it helpful for summarizing documents and selections of text, but it can do a lot more than that apparently.
https://www.devontechnologies.com/blog/20250513-local-ai-in-...
Oddly, my 2013 MacPro (Trashcan) runs LLMs pretty well, mostly because 64Gb of old school RAM is, like, $25.
I think the best models around right now that most people can fit some quantization on their computer if it's a apple silicon Mac or gaming PC would be:
For non-coding: Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)
For coding: Qwen3-Coder-30B-A3B-Instruct
---
If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5
I have a macbook air M4 with 32 GB. What LM Studio models would you recommend for:
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
The really though spot is finding a good model for your use case. I’ve a 16Gb MB and have been paralyzed by the many options. I’ve settle for a quantisied 14B Qwen for now, but no idea if this is a good idea.
What is the best local model for cursor style autocomplete/code suggestions? And is there an extension for vs code which can integrate local model for such use?
I am still looking for a local image captioner, any suggestion which are the 3 easiest to use?
By far, the easiest (open source/Mac) is with Pico AI Server with Witsy for a front end:
https://apps.apple.com/us/app/pico-ai-server-llm-vlm-mlx/id6...
Witsy:
https://github.com/nbonamy/witsy
...and you really want at least 48G RAM to run >24B models.
#1 thing they need to do is open up ANE for developers to properly access
>I also use them for brain-dumping. I find it hard to keep a journal, because I find it boring, but when you’re pretending to be writing to someone, it’s easier. If you have friends, that’s much better, but some topics are too personal and a friend may not be available at 4 AM. I mostly ignore its responses, because it’s for me to unload, not to listen to a machine spew slop. I suggest you do the same, because we’re anthropomorphization machines and I’d rather not experience AI psychosis. It’s better if you don’t give it a chance to convince you it’s real. I could use a system prompt so it doesn’t follow up with dumb questions (or “YoU’Re AbSoLuTeLy CoRrEcT”s), but I never bothered as I already don’t read it.
Reads like someone starting to get their daily drinks, already using them for "company" and fun, and saying "I'm not an alcoholic, I can quit anytime".
I still don't think MacOS is such a great idea
An awful lot of Monday morning quarterback CEOs are here running their mouths about what Tim Cook should do or what they would do. Chill out with the extremely confident ignorance. Tim Cook brought Apple to a billion dollars in free cash he doesn’t need to ride the hype train.
Also let’s not forget they are first and foremost designers of hardware and the arms race is only getting started.
ollama is another good choice for this purpose. it's essentially a wrapper around llamacpp that adds easy downloading and management of running instances. it's great! also works on linux!
[flagged]
[flagged]
[dead]
[flagged]
[dead]