Eight steps. Every one is a creative decision I make, not a button I press. Here is what the AI actually does at each stage, and what it doesn't.
01
Lyrics, by hand
analog
Pen and paper. No prompts, no autocomplete, no screen. Writing by hand slows the mind down enough to hear what the song actually wants to say. This is the only step where AI is intentionally absent — the feeling has to exist before it can be translated.
02
The song
text-to-music
I don't hum the melody into a microphone. I describe the emotional shape of the song using a producer's vocabulary — tempo, key, time signature, instrumentation, vocal character, reverb type, dynamics, the point where the drop lands. Dozens of seeds per track until one lands emotionally.
Technical: structured prompt engineering for text-to-music generative models. Cloud: Suno, Udio. Local via ComfyUI on RTX 5090: MusicGen (Meta), Stable Audio (Stability AI), YuE, ACE-Step, Riffusion. Each prompt specifies BPM, key signature, genre tags, instrumentation stack, effects chain (plate, hall, spring reverb), and song structure markers.
03
Scene planning
storyboard
I break the song into scenes — typically 4–7 seconds each, matching lyric changes, beat drops, or emotional beats. Each scene gets a visual concept: subject, location, camera framing, lighting, mood. This is where the film begins to exist, long before any image is generated.
04
Character training
LoRA
If a character appears in multiple scenes, they have to look like the same person every time. Generative models drift. The fix is training a small custom adapter on 20–60 reference images, teaching the base AI "this is who they are." Training runs locally and takes anywhere from 30 to 120 minutes per character.
Technical: LoRA — Low-Rank Adaptation — fine-tuning on Stable Diffusion base models. Also DreamBooth and Textual Inversion for specific cases. Trained in ComfyUI on RTX 5090 (32GB VRAM), 96GB DDR5, AMD Ryzen 9950X3D. Output is a small adapter file that modifies the base model's weights to consistently render a specific subject.
05
Scene imagery
text-to-image
Each scene is generated as a still image first — composition, lighting, mood, camera framing, all controlled through layered prompts and reference conditioning. The LoRA ensures character consistency. Typically 10–30 regenerations per scene to get one that holds up.
Technical: text-to-image generative models. Base models: SDXL, Flux.1 (dev / schnell / pro), SD 3.5, HiDream, Kolors, Pixart. Composition control: ControlNet (OpenPose, Canny, Depth, Lineart). Style and identity reference: IPAdapter, InstantID, PuLID. All orchestrated in ComfyUI workflows.
Static images don't breathe. To bring them to life, I describe the motion — slow camera pushes, subject actions, light shifts across frames, particle effects. The AI fills in the frames between stillness and motion. Every clip is 4–7 seconds — that's the ceiling of a single generation.
Technical: image-to-video generative models. Local via ComfyUI: Hunyuan Video (Tencent), Wan 2.1 / Wan 2.2 (Alibaba), LTX Video (Lightricks), CogVideoX, AnimateDiff, Stable Video Diffusion. Cloud for select shots: Runway Gen-3 / Gen-4, Luma Ray2, Kling, Pika, MiniMax Hailuo. Motion prompts describe camera moves (dolly, pan, tilt, zoom), subject action, and lighting dynamics. Each clip renders at 24fps.
Every clip comes together on a timeline. Cut to the music, color-graded for mood, synced to lyrics, titles placed, transitions finessed. This is where the film becomes a film.
Technical: Wondershare Filmora — multi-track NLE with waveform-based audio sync, LUT color grading, keyframe animation for titles and transforms, motion tracking.
The final pass does two things most viewers don't notice but always feel. First, AI invents the frames between every original frame — turning 24fps into 60fps for motion that feels real. Second, it doubles the resolution from HD to 4K, filling in plausible detail the original generation didn't capture.
Technical: Topaz Video AI. Frame interpolation models: Apollo, Chronos (optical flow + neural synthesis). Upscaling models: Proteus, Gaia, Artemis, Iris. All trained on millions of high-resolution footage pairs.