Run near-Opus AI locally for $40k: one dev's hardware guide

Manaal KhanJuly 4, 2026 at 5:47 AM6 min read

Key Takeaways

A $2,000 setup with dual RTX 3090s can run Qwen3.6-27B and local speech-to-text
The $40,000 build uses 4x RTX PRO 6000s for 384GB VRAM, approaching Claude Opus performance
Buying last-gen EPYC and DDR4 from eBay cuts base system costs to $5,600 while maximizing VRAM budget

James O'Beirne, a Bitcoin Core developer, has published a detailed guide on GitHub for running state-of-the-art large language models locally. The build costs range from $2,000 for a capable speech-to-text and Qwen setup to roughly $46,000 for a 384GB VRAM machine that he claims approaches Claude Opus performance. The guide surfaced on Hacker News this week, where it drew attention for its contrarian stance: buy last-gen server parts on eBay, spend the savings on GPU memory.

"If Dario and Altman are giving you heartburn (they should be), read on to figure out how to run this new kind of computing locally," O'Beirne writes in the README, referencing Anthropic's and OpenAI's CEOs. The guide is opinionated, explicit about trade-offs, and includes ready-to-run Docker configurations for models like GLM-5.2-594B.

What does the $2,000 entry-level setup include?

For users unwilling to spend five figures, O'Beirne recommends two RTX 3090 GPUs, totaling 48GB of VRAM. This hardware can run Qwen3.6-27B, which he describes as "an awesome model," plus whisper-large-v3 for local speech-to-text. The STT component requires only about 11GB of VRAM on an Nvidia GPU.

The speech-to-text use case is notable. O'Beirne writes that he finds local STT "surprisingly useful" and that he feels comfortable using it, unlike a hosted equivalent. Privacy is the implicit driver: local inference means no audio leaving your network.

How the $40,000 build maximizes VRAM over everything else

The flagship build uses four RTX PRO 6000 Blackwell Workstation cards at approximately $11,500 each, totaling 384GB of VRAM. O'Beirne bought his cards earlier, when prices were lower; current buyers will pay the $46,000 he quotes.

The base system is deliberately modest. He chose a last-generation AMD EPYC Milan 7313P processor, an ASRock Rack ROMED8-2T motherboard, and 128GB of DDR4 ECC RAM purchased on eBay. Total for the non-GPU components: $5,587. The logic is straightforward. PCIe5 and DDR5 systems are "terrifically expensive as of July 2026," while VRAM is where performance actually scales for large models.

To let the four GPUs communicate at wire speed during tensor parallelism, he added a c-payne PCIe Gen4 switch based on the Microchip Switchtec PM40100 chip. The switch sub-BOM, including host adapter and cables, runs about $1,330. Without it, all inter-GPU traffic would route through the PCI root complex, adding latency.

Performance numbers and model recommendations

O'Beirne reports Gen4 line-rate performance of 27.5 GB/s read and 50.4 GB/s write, with sub-microsecond latency between cards. For model serving, he currently recommends GLM-5.2-Int8Mix-NVFP4-REAP-594B on the 4x RTX6kPRO setup, claiming roughly 80 tokens per second at a 460,000-token context window. The repo includes a vLLM Docker Compose configuration.

He also floats an alternative strategy: instead of four RTX PRO 6000s, build a linked 4x DGX Spark cluster for 512GB total VRAM and use the slower, larger brain to orchestrate a faster Qwen3.7-27B for routine tasks. The guide does not price out this alternative in detail.

Configuration quirks that matter

Several "little-known secrets" surface in the guide. BIOS settings require PCIe bifurcation, correct link speed, and ASPM tuning. Kernel parameters need iommu=off or NCCL (Nvidia's multi-GPU communication library) will hang. ACS must be disabled to keep peer-to-peer traffic inside the switch fabric.

Power is a constraint. O'Beirne runs $46,000 worth of silicon on a standard 110V circuit by GPU power limiting. The guide does not specify the exact wattage cap, but the two Super Flower 1700W PSUs suggest headroom exists for higher draw if you have the electrical service.

Why build locally instead of using cloud APIs?

The guide does not include a breakeven analysis, but the economics favor heavy users. A team making thousands of API calls per day to Claude or GPT-4 will recover hardware costs faster than a solo developer running occasional queries. The privacy argument is simpler: local means no prompts, no completions, and no audio leave your premises.

For organizations subject to data residency requirements or handling sensitive intellectual property, local inference sidesteps vendor data-processing agreements entirely. Whether the performance delta to hosted Opus matters depends on your use case.

ℹ️

Logicity's Take

O'Beirne's guide is aimed at developers willing to get their hands dirty, literally. He built a custom wood enclosure for the PCI switch and GPUs. That is not a weekend project for a startup CTO. But the eBay-EPYC strategy is replicable: server hardware deprecates fast, and DDR4 ECC still works fine for feeding GPUs. For teams evaluating local inference, the real question is whether near-Opus performance justifies $40k plus ongoing power and maintenance versus managed alternatives. If you need automation around model serving, tools like [n8n](https://logicity.in/r/n8n) or [Make](https://logicity.in/r/make) can orchestrate local endpoints the same way they handle cloud APIs.

ℹ️

Disclosure

Some links in this post are affiliate links — Logicity earns a commission if you sign up, at no extra cost to you. We only link products we have used or actively recommend.

Frequently Asked Questions

Can I run Claude Opus locally?

No. Claude is closed-source and available only through Anthropic's API. O'Beirne's guide uses open-weight models like GLM-5.2-594B that he claims approach Opus performance.

What GPU should I buy for local LLMs on a budget?

O'Beirne recommends two RTX 3090s for 48GB total VRAM at around $2,000. This setup can run Qwen3.6-27B and whisper-large-v3 for speech-to-text.

Why use eBay parts for a server build?

Server components depreciate quickly when new generations launch. Last-gen EPYC processors and DDR4 ECC RAM cost a fraction of current-gen equivalents while still feeding modern GPUs adequately.

What is a PCIe switch and why does it matter for multi-GPU setups?

A PCIe switch lets GPUs communicate directly at wire speed instead of routing through the CPU's PCI root complex. This reduces latency during tensor parallelism, which large models require.

How much power does a 4x RTX PRO 6000 system draw?

The guide does not specify exact draw, but O'Beirne runs his rig on a 110V circuit using GPU power limiting. The two 1700W PSUs provide headroom for higher-amperage installations.

Need Help Implementing This?

Logicity can connect you with infrastructure consultants experienced in GPU server builds and local LLM deployment. Reach out via our contact page.

Source: Hacker News: Best

A new technology is set to revolutionize the way AI agents learn and adapt, enabling them to accumulate wisdom and apply it to new situations. This innovation has the potential to significantly boost the reliability of AI agents, especially in complex tasks. By converting raw agent trajectories into reusable guidelines, this tech is poised to transform the AI landscape.

9 Apr 2026

Trending Tech·10 min

The Dark Side of AI: How Bots Are Fueling a Monetized Abuse Ecosystem

A recent analysis of 2.8 million Telegram messages reveals a shocking truth: AI-powered bots are being used to create and sell non-consensual intimate images. These bots can turn ordinary photos into synthetic nude images, and the abuse is being monetized through affiliate programs and subscription-based archives. The researchers behind the study are calling for stricter regulations to combat this growing problem.

9 Apr 2026

Trending Tech·8 min

AI's Secret Sauce: How Journalism Became the Unlikely Ingredient

A recent study reveals that AI chatbots rely heavily on journalistic sources for their quotes, with one in four coming from news outlets. This shocking discovery has significant implications for the media industry and our understanding of AI's information gathering processes. As AI technology continues to evolve, it's essential to consider the role of journalism in shaping its responses.

9 Apr 2026