Lane · Personal · Audio & Visual

A feeling, translated — not generated.

People think you press a button and AI delivers a music video. That is not what this is. The internet is full of 5–10 second clips for a reason — that's the current technical ceiling of a single generation pass. A 4–5 minute music video is a 30+ step assembly of hundreds of short generations, each reviewed, often rejected, often regenerated.

10s
Max single-pass AI clip length today
30–80h
Typical work per finished music video
<0.5%
Of AI users can deliver full pieces end-to-end
02What it actually takes

Two years. A thousand hours. One workstation.

Before the first clip renders, someone had to spend a long time learning which model to reach for, which prompts actually work, which parameters break which workflow, and which hardware can even load the files. The tools improve monthly — the craft doesn't auto-update. You do.

1,000h+
Over 2 years of focused practice to master the stack
From first failed LoRA to finishing a full music video in a single day — the learning curve is measured in hundreds of rejected generations, not hours of tutorials.

And it doesn't run on a laptop.

A standard laptop — even a "gaming laptop" — cannot execute most of this pipeline locally. Video generation models require 24GB+ of GPU memory just to load. Character training needs enterprise-grade compute. Here's what the workstation actually looks like.

GPU · the engine
NVIDIA RTX 5090
32 GB GDDR7 · Blackwell · 21,760 CUDA cores
Runs every model — image generation, LoRA training, video synthesis. Below 24GB VRAM, most current video models refuse to load.
CPU · the conductor
AMD Ryzen 9 9950X3D
16 cores · 32 threads · Zen 5 · 3D V-Cache
Orchestrates the pipeline — preprocessing, data loading, workflow execution — while the GPU does the heavy math.
Memory
96 GB DDR5
6000 MHz+ · dual-channel
Holds model weights and intermediate frames. Video workflows routinely use 60GB+ during a single render.
Storage
NVMe · multi-TB
7,000–12,000 MB/s read · PCIe Gen4/5
Single video model checkpoints are 10–50GB. Fast storage turns minute-long loads into seconds.
03Process — lyrics to final cut

From handwritten line to 4K frame.

Eight steps. Every one is a creative decision I make, not a button I press. Here is what the AI actually does at each stage, and what it doesn't.

01
Lyrics, by hand
analog
Pen and paper. No prompts, no autocomplete, no screen. Writing by hand slows the mind down enough to hear what the song actually wants to say. This is the only step where AI is intentionally absent — the feeling has to exist before it can be translated.
02
The song
text-to-music
I don't hum the melody into a microphone. I describe the emotional shape of the song using a producer's vocabulary — tempo, key, time signature, instrumentation, vocal character, reverb type, dynamics, the point where the drop lands. Dozens of seeds per track until one lands emotionally.
Technical: structured prompt engineering for text-to-music generative models. Cloud: Suno, Udio. Local via ComfyUI on RTX 5090: MusicGen (Meta), Stable Audio (Stability AI), YuE, ACE-Step, Riffusion. Each prompt specifies BPM, key signature, genre tags, instrumentation stack, effects chain (plate, hall, spring reverb), and song structure markers.
03
Scene planning
storyboard
I break the song into scenes — typically 4–7 seconds each, matching lyric changes, beat drops, or emotional beats. Each scene gets a visual concept: subject, location, camera framing, lighting, mood. This is where the film begins to exist, long before any image is generated.
04
Character training
LoRA
If a character appears in multiple scenes, they have to look like the same person every time. Generative models drift. The fix is training a small custom adapter on 20–60 reference images, teaching the base AI "this is who they are." Training runs locally and takes anywhere from 30 to 120 minutes per character.
Technical: LoRA — Low-Rank Adaptation — fine-tuning on Stable Diffusion base models. Also DreamBooth and Textual Inversion for specific cases. Trained in ComfyUI on RTX 5090 (32GB VRAM), 96GB DDR5, AMD Ryzen 9950X3D. Output is a small adapter file that modifies the base model's weights to consistently render a specific subject.
05
Scene imagery
text-to-image
Each scene is generated as a still image first — composition, lighting, mood, camera framing, all controlled through layered prompts and reference conditioning. The LoRA ensures character consistency. Typically 10–30 regenerations per scene to get one that holds up.
Technical: text-to-image generative models. Base models: SDXL, Flux.1 (dev / schnell / pro), SD 3.5, HiDream, Kolors, Pixart. Composition control: ControlNet (OpenPose, Canny, Depth, Lineart). Style and identity reference: IPAdapter, InstantID, PuLID. All orchestrated in ComfyUI workflows.
06
Motion
image-to-video
Static images don't breathe. To bring them to life, I describe the motion — slow camera pushes, subject actions, light shifts across frames, particle effects. The AI fills in the frames between stillness and motion. Every clip is 4–7 seconds — that's the ceiling of a single generation.
Technical: image-to-video generative models. Local via ComfyUI: Hunyuan Video (Tencent), Wan 2.1 / Wan 2.2 (Alibaba), LTX Video (Lightricks), CogVideoX, AnimateDiff, Stable Video Diffusion. Cloud for select shots: Runway Gen-3 / Gen-4, Luma Ray2, Kling, Pika, MiniMax Hailuo. Motion prompts describe camera moves (dolly, pan, tilt, zoom), subject action, and lighting dynamics. Each clip renders at 24fps.
07
Edit
Filmora
Every clip comes together on a timeline. Cut to the music, color-graded for mood, synced to lyrics, titles placed, transitions finessed. This is where the film becomes a film.
Technical: Wondershare Filmora — multi-track NLE with waveform-based audio sync, LUT color grading, keyframe animation for titles and transforms, motion tracking.
08
Finish
Topaz Video AI
The final pass does two things most viewers don't notice but always feel. First, AI invents the frames between every original frame — turning 24fps into 60fps for motion that feels real. Second, it doubles the resolution from HD to 4K, filling in plausible detail the original generation didn't capture.
Technical: Topaz Video AI. Frame interpolation models: Apollo, Chronos (optical flow + neural synthesis). Upscaling models: Proteus, Gaia, Artemis, Iris. All trained on millions of high-resolution footage pairs.
04Media

See it. Hear it.

Featured video below. Audio playlist auto-discovers every track in the library — click any bar to scrub.

Featured work Pisicuța — Eroare în Sistem
Audio · library
loading…
Discovering tracks…