Lane · Personal · Self-hosted AI

Three ways to run AI. One wins per task.

The hardware in my office can run a 120-billion-parameter model. A $20/month subscription gives me Claude Opus in a browser. An API key lets me embed any model inside my own tools. They're not competing — they're three different answers to three different questions. Pick wrong and you overpay, leak data, or hit a capability wall. Two live demos of the API route are right below; the deeper discussion of when each approach wins is further down.

01 Live chat · API Claude Haiku 4.5 · streaming · password gate

Talk to Claude Haiku 4.5, live.

This chat goes through claude-proxy.php on this server, which holds the API key and counts tokens per visitor. A few exchanges are open to everyone. After that, you'll be asked for a password — the demo budget is shared and I'd rather not have it eaten by one curious visitor. The assistant is loaded with my professional background and a map of this site, so ask it anything — where I've worked, what each page shows, how the solar simulation was built. It'll answer directly.

Razvan's Portfolio Assistant

claude-haiku-4-5 · via api.anthropic.com

online

assistant claude-haiku-4-5

Hi — I'm Razvan's portfolio assistant, running on Claude Haiku 4.5. I'm tuned to answer any question about Razvan Gheorghies — his work, his projects, his background, what's on this site — honestly and directly, no hedging. If I don't know something, I'll say so. Just ask.

Enter send · Shift+Enter new line Scope: Razvan & the portfolio site only

Conversations are logged for moderation and improvement.

02 Image generation Pollinations · 2 models · free · no key

Two models, same prompt.

Image generation doesn't need a proxy or a gate — Pollinations.ai hosts image models behind no-key URL endpoints. Type a prompt below and the same text goes to FLUX Schnell (Black Forest Labs, fast and stylistic) and Z-Image (Pollinations' native model, different aesthetic). Two outputs in parallel lets you see how much the underlying model matters for the same words. Why only two? Pollinations offers more models, but the others hit IP-level rate limits when fired in bursts — rather than show a grid where four slots fail every time, the demo sticks to what works reliably.

Try:

03 Deeper Expand any section below to read more

Why three ways, and when each wins.

The demos above are the one that ships software — API AI. But there are two other routes: running models on your own hardware, and paying a subscription for frontier web access. Each wins different tasks. Click any of the four below to read how I think about the tradeoff and what's in my actual local stack.

The discipline Three axes · three answers ▾

Every AI decision is a tradeoff triangle.

The three axes are privacy, cost, and capability. No single way of running AI wins on all three. Local models give absolute privacy and zero per-query cost, but the 120-billion-parameter ceiling they can hit locally still loses to Claude Opus on hard reasoning. Web subscriptions deliver frontier capability for a flat fee, but every keystroke goes to a third-party log. API keys give you frontier capability and the ability to build it into your own products, but you're billed per token and your data flows through the vendor's pipes.

The discipline isn't picking one — it's knowing which to use when. Regulated document review? Local. Quick reasoning lookups during work? Web subscription. A field tool that parses service reports and runs on a technician's laptop? API, because the output has to flow through code. The rest of this page is my actual stack, model-by-model, with a live demo at the end.

Local AI LM Studio · 9 models · 181 GB ▾

The models on my own hardware.

LM Studio running on a Ryzen 9 9950X3D / 96 GB DDR5 / RTX 5090 desktop. Nine models across four architecture families, ranging from 3B to 120B parameters. Everything below runs offline, for free, with zero data leaving the machine. The tradeoff is speed — the 120B model is slower than Claude Haiku — and capability ceiling — Opus-class reasoning is still out of reach at home. But for sensitive work, for experimentation, and for the category of "this must never hit a cloud API", local is the right answer.

Total models9

Disk used~181 GB

Largest120 B

Smallest3 B

FrontendLM Studio

llama 8B · Q4_K_S

dolphin3.0-llama3.1-8b

dphn

4-bit K-quant 4.7 GB

Uncensored fine-tune of Llama 3.1. Fast, conversational, no safety filters — good for creative writing and prompt exploration.

llama 8B · Q4_K_M

dolphin-2.9-llama3-8b

dphn

4-bit medium K-quant 4.9 GB

Earlier Dolphin generation on Llama 3 base. Kept for A/B comparison against the 3.0 line — fine-tunes age differently than their bases.

phi3 3B · Q4_K_S

phi-3.5-mini-instruct-uncensored

bartowski

4-bit K-quant 2.2 GB

Microsoft's small Phi-3.5 with bartowski's uncensoring. Punches above its weight on reasoning — the smallest model that still gives usable answers.

phi3 3B · Q4_K_S

phi-3-mini-128k-instruct-imatrix-smashed

PrunaAI

iMatrix · 128k ctx 2.2 GB

Phi-3 with a 128 k token context window, PrunaAI's iMatrix compression. For long-document QA — reads a whole service manual in one pass.

qwen2vl 7B · Q4_K_M

qwen/qwen2.5-v1-7b

lmstudio-community

4-bit medium K-quant 6.0 GB

Alibaba's Qwen 2.5 vision-language 7B. Strong multilingual (including Romanian and Danish), good at code and reading images with text.

gpt-oss 20B · MXFP4

openai/gpt-oss-20b

lmstudio-community

MXFP4 quantisation 12.1 GB

OpenAI's open-weight 20B release. MXFP4 is a modern 4-bit floating-point quant — better quality retention than integer quants at the same size.

gpt-oss 120B · MXFP4

openai/gpt-oss-120b

lmstudio-community

MXFP4 quantisation 63.4 GB

The largest local model that fits in 96 GB RAM. GPT-OSS at 120 B is the high-water mark of what's runnable at home — slow but capable, closest to frontier output quality from a local weight.

qwen2vl 72B · Q5_K_M

qwen2.5-v1-72b-instruct-abliterated-deep

nvcto

5-bit medium K-quant 54.9 GB

Qwen 2.5 72B, abliterated (safety-refusal removed) and 5-bit quantised. Near-frontier capability for research and hard-problem work where refusals would block the workflow.

gemma 31B · Q4_K_M

gemma-4-31b

google · ollama

4-bit medium K-quant ~20 GB

Google's Gemma 4 at 31B parameters, planned for addition to the local stack. Google's open-weights family punches above its weight on reasoning with a permissive license — slotting into the daily-driver role between the smaller Phi models and the top-end 72B/120B.

Web AI Subscription frontier ▾

Paying for a chat box, not for code.

This is ChatGPT, Claude.ai, Gemini web, Grok on x.com. You pay a flat monthly fee and get the vendor's best model in a browser. No API key, no code, no integration. What you get is the best reasoning available anywhere for $20 a month — frontier models behind a text box. What you don't get is the ability to build anything on top of it.

Web — flat-fee frontier reasoning

~$20 / month · unlimited-ish

Privacy

Vendor sees everything

Cost

$20 flat · no per-query

Capability

Opus · GPT-5 · Gemini 2.5 Pro

Use it for

Daily thinking partner, research, writing
Debugging hard problems where capability > privacy
One-off tasks that don't justify API setup
Exploring what a frontier model can actually do

Don't use it for

Regulated data · PII · client IP · health records
Anything that needs to run inside your own software
Automation — there's nothing to call programmatically
Workflows where repeatable output matters (prompts drift)

API AI Keys · metered · programmable ▾

When the model has to live inside your code.

An API key lets you call a frontier model from your own software. Tools like my DALUM commissioning report generator, the document extractor on this site, the field-tech chat assistant — none of those work with a web subscription, because they need to run inside code I wrote. You pay per token instead of per month, which is cheaper if you use it rarely and more expensive if you hammer it. The vendor still sees the data, but only the specific tokens your code sends — no browsing history, no account, just a keyed request.

The demo in §05 below runs through this route. Your questions go to api.anthropic.com via a small PHP proxy on this server. The proxy holds the key so it isn't visible in the page, rate-limits per visitor, and prompts for a password after a few exchanges so the demo budget doesn't evaporate on anyone who wants to chat for an hour.

API — per-token, programmable

~$3 / million tokens · Haiku 4.5

Privacy

Vendor sees requests only

Cost

Pennies-to-dollars / task

Capability

Full model · composable

Use it for

Tools and products — embed the model in software
Batch jobs, pipelines, scheduled automation
Any task where output must flow through code
Apps where end-users don't have their own accounts

Don't use it for

Daily personal chat — web subscription is cheaper
Regulated data that must not leave your infrastructure
Prototyping a single prompt — the web UI is faster
Billing surprises — always set monthly budget caps