Lane · Personal · Self-hosted AI

Three ways to run AI. One wins per task.

The hardware in my office can run a 120-billion-parameter model. A $20/month subscription gives me Claude Opus in a browser. An API key lets me embed any model inside my own tools. They're not competing — they're three different answers to three different questions. Pick wrong and you overpay, leak data, or hit a capability wall. Two live demos of the API route are right below; the deeper discussion of when each approach wins is further down.

01 Live chat · API Claude Haiku 4.5 · streaming · password gate

Talk to Claude Haiku 4.5, live.

This chat goes through claude-proxy.php on this server, which holds the API key and counts tokens per visitor. A few exchanges are open to everyone. After that, you'll be asked for a password — the demo budget is shared and I'd rather not have it eaten by one curious visitor. The assistant is loaded with my professional background and a map of this site, so ask it anything — where I've worked, what each page shows, how the solar simulation was built. It'll answer directly.

C
Razvan's Portfolio Assistant
claude-haiku-4-5 · via api.anthropic.com
online
C
assistant claude-haiku-4-5

Hi — I'm Razvan's portfolio assistant, running on Claude Haiku 4.5. I'm tuned to answer any question about Razvan Gheorghies — his work, his projects, his background, what's on this site — honestly and directly, no hedging. If I don't know something, I'll say so. Just ask.

Enter send · Shift+Enter new line Scope: Razvan & the portfolio site only
Conversations are logged for moderation and improvement.
02 Image generation Pollinations · 2 models · free · no key

Two models, same prompt.

Image generation doesn't need a proxy or a gate — Pollinations.ai hosts image models behind no-key URL endpoints. Type a prompt below and the same text goes to FLUX Schnell (Black Forest Labs, fast and stylistic) and Z-Image (Pollinations' native model, different aesthetic). Two outputs in parallel lets you see how much the underlying model matters for the same words. Why only two? Pollinations offers more models, but the others hit IP-level rate limits when fired in bursts — rather than show a grid where four slots fail every time, the demo sticks to what works reliably.

Try:
03 Deeper Expand any section below to read more

Why three ways, and when each wins.

The demos above are the one that ships software — API AI. But there are two other routes: running models on your own hardware, and paying a subscription for frontier web access. Each wins different tasks. Click any of the four below to read how I think about the tradeoff and what's in my actual local stack.

The discipline Three axes · three answers

Every AI decision is a tradeoff triangle.

The three axes are privacy, cost, and capability. No single way of running AI wins on all three. Local models give absolute privacy and zero per-query cost, but the 120-billion-parameter ceiling they can hit locally still loses to Claude Opus on hard reasoning. Web subscriptions deliver frontier capability for a flat fee, but every keystroke goes to a third-party log. API keys give you frontier capability and the ability to build it into your own products, but you're billed per token and your data flows through the vendor's pipes.

The discipline isn't picking one — it's knowing which to use when. Regulated document review? Local. Quick reasoning lookups during work? Web subscription. A field tool that parses service reports and runs on a technician's laptop? API, because the output has to flow through code. The rest of this page is my actual stack, model-by-model, with a live demo at the end.

Local AI LM Studio · 9 models · 181 GB

The models on my own hardware.

LM Studio running on a Ryzen 9 9950X3D / 96 GB DDR5 / RTX 5090 desktop. Nine models across four architecture families, ranging from 3B to 120B parameters. Everything below runs offline, for free, with zero data leaving the machine. The tradeoff is speed — the 120B model is slower than Claude Haiku — and capability ceiling — Opus-class reasoning is still out of reach at home. But for sensitive work, for experimentation, and for the category of "this must never hit a cloud API", local is the right answer.

Total models9
Disk used~181 GB
Largest120 B
Smallest3 B
FrontendLM Studio
llama 8B · Q4_K_S
dolphin3.0-llama3.1-8b
dphn
4-bit K-quant 4.7 GB
Uncensored fine-tune of Llama 3.1. Fast, conversational, no safety filters — good for creative writing and prompt exploration.
llama 8B · Q4_K_M
dolphin-2.9-llama3-8b
dphn
4-bit medium K-quant 4.9 GB
Earlier Dolphin generation on Llama 3 base. Kept for A/B comparison against the 3.0 line — fine-tunes age differently than their bases.
phi3 3B · Q4_K_S
phi-3.5-mini-instruct-uncensored
bartowski
4-bit K-quant 2.2 GB
Microsoft's small Phi-3.5 with bartowski's uncensoring. Punches above its weight on reasoning — the smallest model that still gives usable answers.
phi3 3B · Q4_K_S
phi-3-mini-128k-instruct-imatrix-smashed
PrunaAI
iMatrix · 128k ctx 2.2 GB
Phi-3 with a 128 k token context window, PrunaAI's iMatrix compression. For long-document QA — reads a whole service manual in one pass.
qwen2vl 7B · Q4_K_M
qwen/qwen2.5-v1-7b
lmstudio-community
4-bit medium K-quant 6.0 GB
Alibaba's Qwen 2.5 vision-language 7B. Strong multilingual (including Romanian and Danish), good at code and reading images with text.
gpt-oss 20B · MXFP4
openai/gpt-oss-20b
lmstudio-community
MXFP4 quantisation 12.1 GB
OpenAI's open-weight 20B release. MXFP4 is a modern 4-bit floating-point quant — better quality retention than integer quants at the same size.
gpt-oss 120B · MXFP4
openai/gpt-oss-120b
lmstudio-community
MXFP4 quantisation 63.4 GB
The largest local model that fits in 96 GB RAM. GPT-OSS at 120 B is the high-water mark of what's runnable at home — slow but capable, closest to frontier output quality from a local weight.
qwen2vl 72B · Q5_K_M
qwen2.5-v1-72b-instruct-abliterated-deep
nvcto
5-bit medium K-quant 54.9 GB
Qwen 2.5 72B, abliterated (safety-refusal removed) and 5-bit quantised. Near-frontier capability for research and hard-problem work where refusals would block the workflow.
gemma 31B · Q4_K_M
gemma-4-31b
google · ollama
4-bit medium K-quant ~20 GB
Google's Gemma 4 at 31B parameters, planned for addition to the local stack. Google's open-weights family punches above its weight on reasoning with a permissive license — slotting into the daily-driver role between the smaller Phi models and the top-end 72B/120B.
Web AI Subscription frontier

Paying for a chat box, not for code.

This is ChatGPT, Claude.ai, Gemini web, Grok on x.com. You pay a flat monthly fee and get the vendor's best model in a browser. No API key, no code, no integration. What you get is the best reasoning available anywhere for $20 a month — frontier models behind a text box. What you don't get is the ability to build anything on top of it.

Web — flat-fee frontier reasoning
~$20 / month · unlimited-ish
Privacy
Vendor sees everything
Cost
$20 flat · no per-query
Capability
Opus · GPT-5 · Gemini 2.5 Pro

Use it for

  • Daily thinking partner, research, writing
  • Debugging hard problems where capability > privacy
  • One-off tasks that don't justify API setup
  • Exploring what a frontier model can actually do

Don't use it for

  • Regulated data · PII · client IP · health records
  • Anything that needs to run inside your own software
  • Automation — there's nothing to call programmatically
  • Workflows where repeatable output matters (prompts drift)
API AI Keys · metered · programmable

When the model has to live inside your code.

An API key lets you call a frontier model from your own software. Tools like my DALUM commissioning report generator, the document extractor on this site, the field-tech chat assistant — none of those work with a web subscription, because they need to run inside code I wrote. You pay per token instead of per month, which is cheaper if you use it rarely and more expensive if you hammer it. The vendor still sees the data, but only the specific tokens your code sends — no browsing history, no account, just a keyed request.

The demo in §05 below runs through this route. Your questions go to api.anthropic.com via a small PHP proxy on this server. The proxy holds the key so it isn't visible in the page, rate-limits per visitor, and prompts for a password after a few exchanges so the demo budget doesn't evaporate on anyone who wants to chat for an hour.

API — per-token, programmable
~$3 / million tokens · Haiku 4.5
Privacy
Vendor sees requests only
Cost
Pennies-to-dollars / task
Capability
Full model · composable

Use it for

  • Tools and products — embed the model in software
  • Batch jobs, pipelines, scheduled automation
  • Any task where output must flow through code
  • Apps where end-users don't have their own accounts

Don't use it for

  • Daily personal chat — web subscription is cheaper
  • Regulated data that must not leave your infrastructure
  • Prototyping a single prompt — the web UI is faster
  • Billing surprises — always set monthly budget caps