Three years ago the first local inference test lasted 180 seconds before thermal shutdown. Four hardware generations later, the same system runs Qwen 2.5-Coder 32B in production — a 32-billion-parameter model — on two used GPUs bought for €300 total.
This is not a budget choice. It is the result of an explicit engineering discipline: validate cheap, deploy expensive. You validate the thermal, acoustic, and mechanical structure on the cheapest hardware that reproduces the real workload; only once the system holds twelve continuous hours at full utilization do you move to the production configuration.
The path from K80 testbed to production was not linear. Between the two there were five complete hardware revisions, each one solving a different problem: thermal dynamics, mechanical vibration stress, driver constraints, cable routing, mounting structure, internal network configuration.
The fourth generation, today in continuous operation, inherits from the K80 testbed everything that works — custom aluminum CAD design, 3D-printed components, cable routing constraints measured under real load, acoustic profile compatible with conversation at one meter — and adds the multi-Gbps internal network that the 32B model requires to stay responsive.
The physical signature is documented: 8 GPUs at 100% for 12 continuous hours, thermal range 35–52°C, zero drift across the full load window. For reference, a K80 in a standard server runs 75–90°C under the same load. No datacenter. No liquid cooling. Residential environment.
What this means operationally. A 32B-parameter model executed locally in real coding assistance, with latencies usable for interactive sessions, with no dependency on an external API, with code and project context never leaving the machine. The marginal cost of a query is electrical, not per-token.
The point is not "self-hosting as a hobby." The point is that inference sovereignty is an architectural choice: where the model runs determines where the data feeding it ends up. On a system validated under load for twelve hours, that choice becomes operationally sustainable, no longer just theoretical.
The system keeps evolving. The fifth generation is already on the drafting board.
Full hardware documentation in the paper "Notes from building a local AI inference system" (PDF, attached to the Two technical papers article).