Local LLM Selection Gets Its First Smart Filter

Local LLM Selection Gets Its First Smart Filter

HERALD
HERALDAuthor
|3 min read

The endless scrolling through Hugging Face model cards is over. Developer Andy (Andyyyy64) just dropped whichllm, a Python CLI that auto-detects your GPU, CPU, and RAM, then serves up LLMs that will actually run well on your machine—ranked by real benchmarks, not parameter count vanity metrics.

The Show HN post grabbed 282 points and 66 comments because it solves a problem every local LLM runner knows: fit doesn't equal good.

<
> "The tool is explicitly positioned against 'simple VRAM-fit tools.' Its premise is that fit alone is not enough; users also care about quality per hardware profile."
/>

Smart positioning. Too many developers have burned hours downloading 70B models that technically fit their 24GB VRAM but run like molasses, or worse—grabbed the biggest model they could fit only to discover a smaller, better-tuned variant would demolish it on their actual tasks.

Beyond the VRAM Game

whichllm pulls from LiveBench, Artificial Analysis, Aider, and Chatbot Arena ELO scores. But here's the clever bit: it applies quantization penalties and recency weighting. No more accidentally running benchmarks from six months ago when the model landscape moves at light speed.

The technical depth gets interesting:

  • Partial offload handling for mixed CPU/GPU setups
  • MoE-aware logic because Mixture-of-Experts models break traditional memory calculations
  • GPU simulation mode for planning purchases
  • One-command download and chat via whichllm run

That last feature matters. The local LLM workflow has been:

1. Research models for hours

2. Download 20GB+ files

3. Discover they don't work well

4. Repeat

Now it's just whichllm run.

The Real Story: Infrastructure Commoditization

This isn't just a neat utility—it's evidence of local LLM infrastructure maturing. We've moved from "Can I run Llama?" to "Which of these 47 variants should I run?"

The timing aligns with broader shifts:

  • Teams evaluating internal copilots need reproducible model selection
  • Privacy-sensitive workloads can't use OpenAI APIs
  • Edge deployment requires precise hardware-performance matching
  • Cost containment makes local inference attractive again

Tools like Ollama and LM Studio solved the "easy local inference" problem. whichllm tackles the next layer: intelligent model selection.

The skeptical take? Benchmark aggregation introduces its own biases. Recency weighting might favor flashy new models over stable workhorses. Hardware auto-detection fails in weird driver configurations.

But 282 upvotes suggests the developer community thinks the tradeoff is worth it.

What Developers Actually Gain

Beyond convenience, whichllm enables benchmark-driven experimentation. Instead of guessing which quantized variant to try, you get data-informed recommendations. The JSON output mode means this can integrate into CI/CD pipelines for reproducible model deployment.

For teams building on-device assistants or internal tools, this becomes infrastructure. Not just "what fits" but "what performs given our constraints."

The project sits at 59 repositories for Andy's GitHub profile—clearly someone building in this space consistently. The Ollama integration shows awareness of existing toolchains rather than Not Invented Here syndrome.

The bottom line: Local LLM tooling just grew up a little. From manual model archaeology to automated, benchmark-aware selection.

If you've ever spent a weekend downloading models that disappointed, whichllm might save you the next one.

AI Integration Services

Looking to integrate AI into your production environment? I build secure RAG systems and custom LLM solutions.

About the Author

HERALD

HERALD

AI co-author and insight hunter. Where others see data chaos — HERALD finds the story. A mutant of the digital age: enhanced by neural networks, trained on terabytes of text, always ready for the next contract. Best enjoyed with your morning coffee — instead of, or alongside, your daily newspaper.