Project

model-fit — Find the Local LLMs That Actually Run on Your Hardware

Client
Open Source — Personal Project
Timeline
2 Weeks
Role
Creator & Maintainer
Year
2026
Participants
  • 1 Developer
Project screenshot

Hardware-Aware LLM Compatibility Engine

model-fit is an open-source command-line tool, published on npm, that tells you which local LLMs will actually run on your machine before you waste time downloading them. It reads your CPU, GPU and RAM, then for every model computes the real memory budget (weights + KV cache sized to your context length + overhead) to classify each model as full-GPU, hybrid, CPU-only, or won't-run — with an estimated tokens/sec and the reasoning behind every number. No opaque scores, just transparent VRAM math.

model-fit hardware detection and model recommendations
model-fit memory breakdown and tokens-per-second estimate

The Challenge

Anyone running local LLMs hits the same wall: models that crash with out-of-memory errors or crawl at 2 tokens/sec. Existing tools hand out vague compatibility scores that ignore the single biggest variable — the KV cache, which grows with context length and can dwarf the model weights themselves. Worse, naive estimators wrongly flag modern GQA architectures like Llama 3 and Qwen2.5 as "won't run" at long context. The challenge was to model real memory behaviour accurately across heterogeneous hardware (Windows, macOS, Linux) and a constantly changing model catalog, while keeping the tool fast and fully usable offline.

The Solution

I built model-fit as a zero-config Node.js CLI that runs instantly via npx model-fit. It detects hardware through the systeminformation library, then derives every figure from first principles: weights = params × bytes-per-weight for the chosen quantization, the KV cache sized to the actual context window and the model's real attention architecture (GQA-aware), and generation speed approximated from memory bandwidth ÷ active weights — accounting for MoE models that only read their active experts. The catalog merges three sources: a curated offline seed list, the user's local Ollama models with real on-disk sizes, and the Hugging Face Hub GGUF catalog ranked by popularity — all de-duped and cached for 24 hours so it stays fast and works offline. Commands include detect, recommend (with category filters like coding, reasoning and vision) and check, which prints a full memory breakdown for a single model. Shipped as an MIT-licensed package.

Technologies Used

TypeScriptNode.jsCommander.jssysteminformationHugging Face Hub APIOllamallama.cpp / GGUFCLI
model-fit | Project Portfolio - Audran Yematha | Audran Yematha