Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?


Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.


Exactly. Relying on external compute for professional work is a non-starter IMO.


Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.


I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.


Some tasks don’t require SOTA models. For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline. Also if you can break down certain tasks like JSON healing into small focused coding tasks then local models are useful


> For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline.

What does better mean here? Does it handle formal vs informal speech? Idiomatic expressions? Regional variances (like American vs British English)? These are areas where Google Translate is weak.

How fast are we talking here (including initial loading times) and what's the impact on your phone battery? Also, what iPhone do you have?

I am really interested in this application hence my questions.


How does that work? Wouldn't it be slow loading the weights into memory every time you launch it?


I'm guessing they're not using it as a word dictionary, but rather translating longer texts where the time to load the model isn't a significant issue.


Is it really better? In which languages?


Yes it is and has been for a very long time, it has been years now. Gemini 1.5 Pro is when LLM translations started significantly outperforming non-LLM machine translation, and that came out over 2 years ago.

Ever since then Google models have been the strongest at translation across the board, so it's no surprise Gemma 4 does well. Gemini 3 Flash is better at translation than any Claude or GPT model. OpenAI models have always been weakest at it, continuing to this day. It's quite interesting how these characteristics have stayed stable over time and many model versions.

I'm primarily talking about non-trivial language pairs, something like English<>Spanish is so "easy" now it's hard to distinguish the strong models.


I've been using gemma4 for translating Mongolian to English. It runs circles around Google Translate for that language pair, it's not even close.


I translate texts between Ukrainian, Russian and English dozens of times daily. The LLM translation is not only better, it's also refineable, you can chat with the AI to make changes to what you meant.


Do you use E2B or E4B?


The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)


The open weights models absolutely compete with flagship labs for most tasks. OpenAI and Anthropic's "cheap tier" models are completely uncompetitive with them for "quality / $" and it's not close. Google is the only one who has remained competitive in the <$5/1M output tier with Flash, and now has an incredibly strong release with Gemma 4.

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.


I've been largely using Qwen3.5-122b at 6 bit quant locally for some c++/go/python dev lately because it is quite capable as long as I can give it pretty specific asks within the codebase and it will produce code that needs minimal massaging to fit into the project.

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.


How much VRAM do you need for that?


Not OP, but I ran 122b successfully with normal RAM offloading. You dont need all that much VRAM, which is super expensive. I used 96gb ram + 16gb vram gpu. But it's not very fast in that setup, maybe 15 token per second. Still, you can give it a task and come back later and its done. (Disclaimer: I build that PC before stuff got expensive)


128GB on a mac with unified memory. The model itself takes something like 110 of that and then I have ~16 left over to hold a reasonably sized context and 2 for the OS.

I do have a dedicated machine for it though because I can't run an IDE at the same time as that model.


I squeeze Qwen3.5-122B-A10B at Q6 into 128GB. It's a great model.


Wow what kind of hardware do you have? Mac Studio, dgx spark, strix halo? How fast is it?


Strix Halo, I'm seeing performance inline with these results[0].

I'm interested to investigate the claimed gains from the lemonade-sdk port of Apple MLX inference[1].

[0]https://kyuz0.github.io/amd-strix-halo-toolboxes/

[1]https://github.com/lemonade-sdk/lemonade/issues/1642


I am using it with pi agent and I have stopped renting tokens. Much better for me than Claude Code, on M1 Max 64GB. This model with oMLX is at 16k context PP 919.9 tok/s and TG 54.7 tok/s. You have to manage the context but the better you manage context the more focused the output is. I use it without thinking.


I use local models for asking about personal financial or health data that I want to keep local and private. Or even just whipping up quick and dirty prototypes for whatever I can think of but not seriously enough to spend tokens that I rather use on real projects.


The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.


I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".


It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.


Oh lord, are the LLMs already replacing LLMs?


I'm using the smaller vision models (Qwen3.5-4B currently) with Frigate, a FOSS self-hosted "AI" NVR. It's good enough at analyzing images to figure out mostly what's happening, and doesn't require the big knowledge base that bigger models have.

Also use a bigger model for summarizing or translating text, which I don't consume in realtime, so doesn't need to be fast. Would be a thing I could use OpenAI's batch APIs for if I did need something higher quality.


I'm using forge code (https://forgecode.dev/) with various local and cloud models and I really like it. MiniMax 2.7 is really great with it, and the new Qwen 3.6 35B A3B feels much stronger, after some testing, than the 3.5 version. Check some harness benchmarks. Forge outperforms Claude Code with Opus by a big margin.


There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.


I am working on a research project to link churches from their IRS Exempt org BMF entry to their google search result from 10 fetched. Gwen2.5-14b on a 16gb Mac Mini. It works good enough!

It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.


While they can be run locally, and most of the discussion on HN about that, I bet that if you look at total tok/day local usage is a tiny amount compared to total cloud inference even for these models. Most people who do use them locally just do a prompt every now and then.


This is why I'd like to see a lot more focus on batched inference with lower-end hardware. If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so, you don't really need top-of-the-line hardware even for SOTA results.


> If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so

But they can't? The usage pattern is the polar opposite. Most people running these models locally just ask a few questions to it throughout the day. They want the answers now, or at least within a minute.


If you want the answer right now, that alone ups your compute needs to the point where you're probably better off just using a free hosted-AI service. Unless the prompt is trivial enough that it can be answered quickly by a tiny local model.


A strix halo machine or MAC will run at less than 20watts idle. You could leave it running.


That’s a good point. I think I saw Together.ai with that offering, but for some reason just never think to throw random non urgent coding tasks at it overnight


They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens


I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?


always inside claude code, just using ollama, takes 2 seconds




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: