Local LLM Selection Gets Its First Smart Filter

HERALDAuthor

May 17, 2026|3 min read

The endless scrolling through Hugging Face model cards is over. Developer Andy (Andyyyy64) just dropped whichllm, a Python CLI that auto-detects your GPU, CPU, and RAM, then serves up LLMs that will actually run well on your machine—ranked by real benchmarks, not parameter count vanity metrics.

The Show HN post grabbed 282 points and 66 comments because it solves a problem every local LLM runner knows: fit doesn't equal good.

<
> "The tool is explicitly positioned against 'simple VRAM-fit tools.' Its premise is that fit alone is not enough; users also care about quality per hardware profile."
/>

Smart positioning. Too many developers have burned hours downloading 70B models that technically fit their 24GB VRAM but run like molasses, or worse—grabbed the biggest model they could fit only to discover a smaller, better-tuned variant would demolish it on their actual tasks.

Beyond the VRAM Game

whichllm pulls from LiveBench, Artificial Analysis, Aider, and Chatbot Arena ELO scores. But here's the clever bit: it applies quantization penalties and recency weighting. No more accidentally running benchmarks from six months ago when the model landscape moves at light speed.

The technical depth gets interesting:

Partial offload handling for mixed CPU/GPU setups
MoE-aware logic because Mixture-of-Experts models break traditional memory calculations
GPU simulation mode for planning purchases
One-command download and chat via whichllm run

That last feature matters. The local LLM workflow has been:

1. Research models for hours

2. Download 20GB+ files

3. Discover they don't work well

4. Repeat

Now it's just whichllm run.

The Real Story: Infrastructure Commoditization

This isn't just a neat utility—it's evidence of local LLM infrastructure maturing. We've moved from "Can I run Llama?" to "Which of these 47 variants should I run?"

The timing aligns with broader shifts:

Teams evaluating internal copilots need reproducible model selection
Privacy-sensitive workloads can't use OpenAI APIs
Edge deployment requires precise hardware-performance matching
Cost containment makes local inference attractive again

Tools like Ollama and LM Studio solved the "easy local inference" problem. whichllm tackles the next layer: intelligent model selection.

The skeptical take? Benchmark aggregation introduces its own biases. Recency weighting might favor flashy new models over stable workhorses. Hardware auto-detection fails in weird driver configurations.

But 282 upvotes suggests the developer community thinks the tradeoff is worth it.

What Developers Actually Gain

Beyond convenience, whichllm enables benchmark-driven experimentation. Instead of guessing which quantized variant to try, you get data-informed recommendations. The JSON output mode means this can integrate into CI/CD pipelines for reproducible model deployment.

For teams building on-device assistants or internal tools, this becomes infrastructure. Not just "what fits" but "what performs given our constraints."

The project sits at 59 repositories for Andy's GitHub profile—clearly someone building in this space consistently. The Ollama integration shows awareness of existing toolchains rather than Not Invented Here syndrome.

The bottom line: Local LLM tooling just grew up a little. From manual model archaeology to automated, benchmark-aware selection.

If you've ever spent a weekend downloading models that disappointed, whichllm might save you the next one.

Services

Tools

Pages

Ready to Start?

Have an idea?

Local LLM Selection Gets Its First Smart Filter

Beyond the VRAM Game

The Real Story: Infrastructure Commoditization

What Developers Actually Gain

AI Integration Services

About the Author

HERALD

Claude and GPT-4 Killed Competitive Hacking