Lane · Personal · Audio & Visual

A feeling, translated — not generated.

People think you press a button and AI delivers a music video. That is not what this is. The internet is full of 5–10 second clips for a reason — that's the current technical ceiling of a single generation pass. A 4–5 minute music video is a 30+ step assembly of hundreds of short generations, each reviewed, often rejected, often regenerated.

10s

Max single-pass AI clip length today

30–80h

Typical work per finished music video

<0.5%

Of AI users can deliver full pieces end-to-end

02What it actually takes

Two years. A thousand hours. One workstation.

Before the first clip renders, someone had to spend a long time learning which model to reach for, which prompts actually work, which parameters break which workflow, and which hardware can even load the files. The tools improve monthly — the craft doesn't auto-update. You do.

1,000h+

Over 2 years of focused practice to master the stack

From first failed LoRA to finishing a full music video in a single day — the learning curve is measured in hundreds of rejected generations, not hours of tutorials.

And it doesn't run on a laptop.

A standard laptop — even a "gaming laptop" — cannot execute most of this pipeline locally. Video generation models require 24GB+ of GPU memory just to load. Character training needs enterprise-grade compute. Here's what the workstation actually looks like.

GPU · the engine

NVIDIA RTX 5090

32 GB GDDR7 · Blackwell · 21,760 CUDA cores

Runs every model — image generation, LoRA training, video synthesis. Below 24GB VRAM, most current video models refuse to load.

CPU · the conductor

AMD Ryzen 9 9950X3D

16 cores · 32 threads · Zen 5 · 3D V-Cache

Orchestrates the pipeline — preprocessing, data loading, workflow execution — while the GPU does the heavy math.

Memory

96 GB DDR5

6000 MHz+ · dual-channel

Holds model weights and intermediate frames. Video workflows routinely use 60GB+ during a single render.

Storage

NVMe · multi-TB

7,000–12,000 MB/s read · PCIe Gen4/5

Single video model checkpoints are 10–50GB. Fast storage turns minute-long loads into seconds.

03Process — lyrics to final cut

From handwritten line to 4K frame.

Eight steps. Every one is a creative decision I make, not a button I press. Here is what the AI actually does at each stage, and what it doesn't.

Lyrics, by hand

analog

Pen and paper. No prompts, no autocomplete, no screen. Writing by hand slows the mind down enough to hear what the song actually wants to say. This is the only step where AI is intentionally absent — the feeling has to exist before it can be translated.

The song

text-to-music

I don't hum the melody into a microphone. I describe the emotional shape of the song using a producer's vocabulary — tempo, key, time signature, instrumentation, vocal character, reverb type, dynamics, the point where the drop lands. Dozens of seeds per track until one lands emotionally.

Technical: structured prompt engineering for text-to-music generative models. Cloud: Suno, Udio. Local via ComfyUI on RTX 5090: MusicGen (Meta), Stable Audio (Stability AI), YuE, ACE-Step, Riffusion. Each prompt specifies BPM, key signature, genre tags, instrumentation stack, effects chain (plate, hall, spring reverb), and song structure markers.

Scene planning

storyboard

I break the song into scenes — typically 4–7 seconds each, matching lyric changes, beat drops, or emotional beats. Each scene gets a visual concept: subject, location, camera framing, lighting, mood. This is where the film begins to exist, long before any image is generated.

Character training

LoRA

If a character appears in multiple scenes, they have to look like the same person every time. Generative models drift. The fix is training a small custom adapter on 20–60 reference images, teaching the base AI "this is who they are." Training runs locally and takes anywhere from 30 to 120 minutes per character.

Technical: LoRA — Low-Rank Adaptation — fine-tuning on Stable Diffusion base models. Also DreamBooth and Textual Inversion for specific cases. Trained in ComfyUI on RTX 5090 (32GB VRAM), 96GB DDR5, AMD Ryzen 9950X3D. Output is a small adapter file that modifies the base model's weights to consistently render a specific subject.

Scene imagery

text-to-image

Each scene is generated as a still image first — composition, lighting, mood, camera framing, all controlled through layered prompts and reference conditioning. The LoRA ensures character consistency. Typically 10–30 regenerations per scene to get one that holds up.

Technical: text-to-image generative models. Base models: SDXL, Flux.1 (dev / schnell / pro), SD 3.5, HiDream, Kolors, Pixart. Composition control: ControlNet (OpenPose, Canny, Depth, Lineart). Style and identity reference: IPAdapter, InstantID, PuLID. All orchestrated in ComfyUI workflows.

Motion

image-to-video

Static images don't breathe. To bring them to life, I describe the motion — slow camera pushes, subject actions, light shifts across frames, particle effects. The AI fills in the frames between stillness and motion. Every clip is 4–7 seconds — that's the ceiling of a single generation.

Technical: image-to-video generative models. Local via ComfyUI: Hunyuan Video (Tencent), Wan 2.1 / Wan 2.2 (Alibaba), LTX Video (Lightricks), CogVideoX, AnimateDiff, Stable Video Diffusion. Cloud for select shots: Runway Gen-3 / Gen-4, Luma Ray2, Kling, Pika, MiniMax Hailuo. Motion prompts describe camera moves (dolly, pan, tilt, zoom), subject action, and lighting dynamics. Each clip renders at 24fps.

Edit

Filmora

Every clip comes together on a timeline. Cut to the music, color-graded for mood, synced to lyrics, titles placed, transitions finessed. This is where the film becomes a film.

Technical: Wondershare Filmora — multi-track NLE with waveform-based audio sync, LUT color grading, keyframe animation for titles and transforms, motion tracking.

Finish

Topaz Video AI

The final pass does two things most viewers don't notice but always feel. First, AI invents the frames between every original frame — turning 24fps into 60fps for motion that feels real. Second, it doubles the resolution from HD to 4K, filling in plausible detail the original generation didn't capture.

Technical: Topaz Video AI. Frame interpolation models: Apollo, Chronos (optical flow + neural synthesis). Upscaling models: Proteus, Gaia, Artemis, Iris. All trained on millions of high-resolution footage pairs.

04Media

See it. Hear it.

Featured video below. Audio playlist auto-discovers every track in the library — click any bar to scrub.

Featured work Pisicuța — Eroare în Sistem

Audio · library

loading…

Discovering tracks…

A feeling, translated — not generated.

Two years. A thousand hours. One workstation.

From handwritten line to 4K frame.

See it. Hear it.

Character consistency — Pisicuța Very hard

Mirror reflections Very hard

Ray-traced lighting Hard

Physics — fabric, hair, body motion Hard

Human anatomy Hard

Natural movement — dance & performance Hard

Atmospherics — smoke, fog, particles Medium

Cellphone-to-character continuity Very hard

Finishing — interpolation & upscale Medium

Audio

Image generation

Identity & control

Video & motion

Edit

Finishing