All posts

Run near-Opus AI locally for $40k: one dev's hardware guide

Manaal KhanJuly 4, 2026 at 5:47 AM6 min read
Run near-Opus AI locally for $40k: one dev's hardware guide

Key Takeaways

Run near-Opus AI locally for $40k: one dev's hardware guide
Source: Hacker News: Best
  • A $2,000 setup with dual RTX 3090s can run Qwen3.6-27B and local speech-to-text
  • The $40,000 build uses 4x RTX PRO 6000s for 384GB VRAM, approaching Claude Opus performance
  • Buying last-gen EPYC and DDR4 from eBay cuts base system costs to $5,600 while maximizing VRAM budget

James O'Beirne, a Bitcoin Core developer, has published a detailed guide on GitHub for running state-of-the-art large language models locally. The build costs range from $2,000 for a capable speech-to-text and Qwen setup to roughly $46,000 for a 384GB VRAM machine that he claims approaches Claude Opus performance. The guide surfaced on Hacker News this week, where it drew attention for its contrarian stance: buy last-gen server parts on eBay, spend the savings on GPU memory.

"If Dario and Altman are giving you heartburn (they should be), read on to figure out how to run this new kind of computing locally," O'Beirne writes in the README, referencing Anthropic's and OpenAI's CEOs. The guide is opinionated, explicit about trade-offs, and includes ready-to-run Docker configurations for models like GLM-5.2-594B.

Advertisement

What does the $2,000 entry-level setup include?

For users unwilling to spend five figures, O'Beirne recommends two RTX 3090 GPUs, totaling 48GB of VRAM. This hardware can run Qwen3.6-27B, which he describes as "an awesome model," plus whisper-large-v3 for local speech-to-text. The STT component requires only about 11GB of VRAM on an Nvidia GPU.

The speech-to-text use case is notable. O'Beirne writes that he finds local STT "surprisingly useful" and that he feels comfortable using it, unlike a hosted equivalent. Privacy is the implicit driver: local inference means no audio leaving your network.

How the $40,000 build maximizes VRAM over everything else

The flagship build uses four RTX PRO 6000 Blackwell Workstation cards at approximately $11,500 each, totaling 384GB of VRAM. O'Beirne bought his cards earlier, when prices were lower; current buyers will pay the $46,000 he quotes.

The base system is deliberately modest. He chose a last-generation AMD EPYC Milan 7313P processor, an ASRock Rack ROMED8-2T motherboard, and 128GB of DDR4 ECC RAM purchased on eBay. Total for the non-GPU components: $5,587. The logic is straightforward. PCIe5 and DDR5 systems are "terrifically expensive as of July 2026," while VRAM is where performance actually scales for large models.

To let the four GPUs communicate at wire speed during tensor parallelism, he added a c-payne PCIe Gen4 switch based on the Microchip Switchtec PM40100 chip. The switch sub-BOM, including host adapter and cables, runs about $1,330. Without it, all inter-GPU traffic would route through the PCI root complex, adding latency.

Performance numbers and model recommendations

O'Beirne reports Gen4 line-rate performance of 27.5 GB/s read and 50.4 GB/s write, with sub-microsecond latency between cards. For model serving, he currently recommends GLM-5.2-Int8Mix-NVFP4-REAP-594B on the 4x RTX6kPRO setup, claiming roughly 80 tokens per second at a 460,000-token context window. The repo includes a vLLM Docker Compose configuration.

He also floats an alternative strategy: instead of four RTX PRO 6000s, build a linked 4x DGX Spark cluster for 512GB total VRAM and use the slower, larger brain to orchestrate a faster Qwen3.7-27B for routine tasks. The guide does not price out this alternative in detail.

Advertisement

Configuration quirks that matter

Several "little-known secrets" surface in the guide. BIOS settings require PCIe bifurcation, correct link speed, and ASPM tuning. Kernel parameters need iommu=off or NCCL (Nvidia's multi-GPU communication library) will hang. ACS must be disabled to keep peer-to-peer traffic inside the switch fabric.

Power is a constraint. O'Beirne runs $46,000 worth of silicon on a standard 110V circuit by GPU power limiting. The guide does not specify the exact wattage cap, but the two Super Flower 1700W PSUs suggest headroom exists for higher draw if you have the electrical service.

Why build locally instead of using cloud APIs?

The guide does not include a breakeven analysis, but the economics favor heavy users. A team making thousands of API calls per day to Claude or GPT-4 will recover hardware costs faster than a solo developer running occasional queries. The privacy argument is simpler: local means no prompts, no completions, and no audio leave your premises.

For organizations subject to data residency requirements or handling sensitive intellectual property, local inference sidesteps vendor data-processing agreements entirely. Whether the performance delta to hosted Opus matters depends on your use case.

ℹ️

Logicity's Take

O'Beirne's guide is aimed at developers willing to get their hands dirty, literally. He built a custom wood enclosure for the PCI switch and GPUs. That is not a weekend project for a startup CTO. But the eBay-EPYC strategy is replicable: server hardware deprecates fast, and DDR4 ECC still works fine for feeding GPUs. For teams evaluating local inference, the real question is whether near-Opus performance justifies $40k plus ongoing power and maintenance versus managed alternatives. If you need automation around model serving, tools like [n8n](https://logicity.in/r/n8n) or [Make](https://logicity.in/r/make) can orchestrate local endpoints the same way they handle cloud APIs.

ℹ️

Disclosure

Some links in this post are affiliate links — Logicity earns a commission if you sign up, at no extra cost to you. We only link products we have used or actively recommend.

Frequently Asked Questions

Can I run Claude Opus locally?

No. Claude is closed-source and available only through Anthropic's API. O'Beirne's guide uses open-weight models like GLM-5.2-594B that he claims approach Opus performance.

What GPU should I buy for local LLMs on a budget?

O'Beirne recommends two RTX 3090s for 48GB total VRAM at around $2,000. This setup can run Qwen3.6-27B and whisper-large-v3 for speech-to-text.

Why use eBay parts for a server build?

Server components depreciate quickly when new generations launch. Last-gen EPYC processors and DDR4 ECC RAM cost a fraction of current-gen equivalents while still feeding modern GPUs adequately.

What is a PCIe switch and why does it matter for multi-GPU setups?

A PCIe switch lets GPUs communicate directly at wire speed instead of routing through the CPU's PCI root complex. This reduces latency during tensor parallelism, which large models require.

How much power does a 4x RTX PRO 6000 system draw?

The guide does not specify exact draw, but O'Beirne runs his rig on a 110V circuit using GPU power limiting. The two 1700W PSUs provide headroom for higher-amperage installations.

Also Read
Microsoft builds its own gas plant for 2GW Texas data center

Another look at the infrastructure costs behind running AI at scale

ℹ️

Need Help Implementing This?

Logicity can connect you with infrastructure consultants experienced in GPU server builds and local LLM deployment. Reach out via our contact page.

Source: Hacker News: Best

Advertisement
M

Manaal Khan

Tech & Innovation Writer

Produced with AI assistance and reviewed by the Logicity editorial team. Learn more in our Editorial Policy.

Related Articles