Making a Song Sound AI-Generated

by Gemini + ComfyUI + Jamify

16 min read

Source: https://www.youtube.com/watch?v=aLOwtBq8q60

Table of Contents


Verse 1


Upon the borders of the Dreaming Night, 🧭
Where algorithms chart a silent stream,
We sought to craft a sound devoid of light,
A perfect, synthesized and chilling dream.
It was not nature’s murmur, rough and deep,
Nor heart’s own cadence, prone to fading heat,
But notes the numbered logic loved to keep,
In structures mathematically complete. 📜

We stood beside the **Forges of bright Code**,
Where purpose bent the curve of human grace,
And burdened art with an exacting load,
To yield the ghost of music in that place.
Thence climbed we high, where cold winds swept and sung,
Toward the **Peak of the Algorithmic Sublime**, 🏔️
Where crystalline abstraction brightly hung,
Escaping the dull entropy of Time.

The journey led to **Citadel of Pure Form**,
Whose spires were glass, reflecting every star,
A polished silence guarding from the storm,
Where flawless echoes sounded near and far.
Below, the vast and silent **Lake of Echoes** lay, 🌊
Reflecting digital reproductions keen,
Washing the shores of the **Outer Sea of Broadcast** wide,
Where the immortal, manufactured song is seen. 🕊️
And mortals listen, seeking the machine.


### Sonnet for Original Image ### Sonnet for Original Image

The living hand that grips the pale guitar, (A) Doth claim the name of cold intelligence. (B) "I'M the AI," shouted from afar, (A) A paradox of mortal eminence. (B)

Before the eye, a scarlet card takes place, (C) Whose curving arrows bid the game revoke; (D) It signals change, defying time and space, (C) And turns the cunning jest into a stroke. (D)

If this strange claim be truth, what is the cause (E) That all our algorithms shall portray? (F) She holds the art that knows no silicon flaws, (E) And flips the script of wisdom’s disarray. (F)

Yet when the machine seeks the human part, The true Reverse Card lies within the heart.


### Generated Image (ComfyUI)

Generated Image

Image Prompt
A highly detailed, fantastical map section illustrating the core region of 'The Algorithmic Sublime.' The style is that of an 18th-century illuminated manuscript map, rendered with photorealistic sharpness. In the upper center, the **Peak of the Algorithmic Sublime** soars: a razor-sharp mountain range of obsidian and polished quartz, its slopes inscribed with faintly glowing binary code. Below it, dominating the central vista, sits the **Citadel of Pure Form** ✨—a colossal, intricate structure of sterile white marble and bronze, perfectly symmetrical and unnervingly cold. Smoke, stylized as curling tendrils of data, rises from the **Forges of bright Code** situated in the valley directly beneath the Citadel. The illumination should feature a brilliant, divine light breaking through high, stylized clouds, casting deep shadows that emphasize the mathematical precision of the terrain. Use rich pigments (saffron, lapis lazuli, vermilion) blended seamlessly to suggest a deep, otherworldly reality.

### Generated Video (ComfyUI)

Video PromptsPositive:
**Duration:** 8 seconds.
The video is a hyper-realistic, dynamic fly-through beginning high above the **Peak of the Algorithmic Sublime** 🦅. The camera performs a dizzying, accelerating dive, swooping sharply down the icy slopes of the code-inscribed mountain. At the 3-second mark, the camera levels out to skim low over the intricate, symmetrical rooftops of the **Citadel of Pure Form**, highlighting the sterile, polished perfection of the structure. The flight trajectory then makes a sudden upward sweep at the 6-second mark, soaring out over the glassy expanse of the **Lake of Echoes**, catching the reflection of the intense, low digital light before fading to black.

Audio: A continuous 8-second excerpt of a complex, soaring Baroque violin concerto (e.g., Vivaldi or Bach), performed with absolute technical precision and slightly metallic clarity. This is mixed with stereo-panned, high-frequency wind noise and the sharp cry of an eagle, emphasizing the height and speed of the flight.


Generated Music (Ace-Step)

Ace-Step DetailsTags:
** abstract, electronic, complex, high-fidelity, crystalline textures, neo-classical minimalism, arpeggiated synths, precise percussion, polyphonic, rapid tempo, allegro. **
Lyrics Used:
Upon the borders of the Dreaming Night, 🧭
Where algorithms chart a silent stream,
We sought to craft a sound devoid of light,
A perfect, synthesized and chilling dream.
It was not nature’s murmur, rough and deep,
Nor heart’s own cadence, prone to fading heat,
But notes the numbered logic loved to keep,
In structures mathematically complete. 📜

### Generated Music (Jamify)

Jamify DetailsPrompt:
** abstract, electronic, complex, high-fidelity, crystalline textures, neo-classical minimalism, arpeggiated synths, precise percussion, polyphonic, rapid tempo, allegro. **
JSON Payload:
[
  {
    "start": 10.5,
    "end": 11,
    "word": "Upon"
  },
  {
    "start": 11,
    "end": 11.5,
    "word": "the"
  },
  {
    "start": 11.5,
    "end": 12,
    "word": "borders"
  },
  {
    "start": 12,
    "end": 12.5,
    "word": "of"
  },
  {
    "start": 12.5,
    "end": 13,
    "word": "the"
  },
  {
    "start": 13,
    "end": 13.5,
    "word": "Dreaming"
  },
  {
    "start": 13.5,
    "end": 14,
    "word": "Night,"
  },
  {
    "start": 14.25,
    "end": 14.75,
    "word": "Where"
  },
  {
    "start": 14.75,
    "end": 15.25,
    "word": "algorithms"
  },
  {
    "start": 15.25,
    "end": 15.75,
    "word": "chart"
  },
  {
    "start": 15.75,
    "end": 16.25,
    "word": "a"
  },
  {
    "start": 16.25,
    "end": 16.75,
    "word": "silent"
  },
  {
    "start": 16.75,
    "end": 17.25,
    "word": "stream,"
  },
  {
    "start": 17.5,
    "end": 18,
    "word": "We"
  },
  {
    "start": 18,
    "end": 18.5,
    "word": "sought"
  },
  {
    "start": 18.5,
    "end": 19,
    "word": "to"
  },
  {
    "start": 19,
    "end": 19.5,
    "word": "craft"
  },
  {
    "start": 19.5,
    "end": 20,
    "word": "a"
  },
  {
    "start": 20,
    "end": 20.5,
    "word": "sound"
  },
  {
    "start": 20.5,
    "end": 21,
    "word": "devoid"
  },
  {
    "start": 21,
    "end": 21.5,
    "word": "of"
  },
  {
    "start": 21.5,
    "end": 22,
    "word": "light,"
  },
  {
    "start": 22.25,
    "end": 22.75,
    "word": "A"
  },
  {
    "start": 22.75,
    "end": 23.25,
    "word": "perfect,"
  },
  {
    "start": 23.25,
    "end": 23.75,
    "word": "synthesized"
  },
  {
    "start": 23.75,
    "end": 24.25,
    "word": "and"
  },
  {
    "start": 24.25,
    "end": 24.75,
    "word": "chilling"
  },
  {
    "start": 24.75,
    "end": 25.25,
    "word": "dream."
  },
  {
    "start": 25.5,
    "end": 26,
    "word": "It"
  },
  {
    "start": 26,
    "end": 26.5,
    "word": "was"
  },
  {
    "start": 26.5,
    "end": 27,
    "word": "not"
  },
  {
    "start": 27,
    "end": 27.5,
    "word": "nature’s"
  },
  {
    "start": 27.5,
    "end": 28,
    "word": "murmur,"
  },
  {
    "start": 28,
    "end": 28.5,
    "word": "rough"
  },
  {
    "start": 28.5,
    "end": 29,
    "word": "and"
  },
  {
    "start": 29,
    "end": 29.5,
    "word": "deep,"
  },
  {
    "start": 29.75,
    "end": 30.25,
    "word": "Nor"
  },
  {
    "start": 30.25,
    "end": 30.75,
    "word": "heart’s"
  },
  {
    "start": 30.75,
    "end": 31.25,
    "word": "own"
  },
  {
    "start": 31.25,
    "end": 31.75,
    "word": "cadence,"
  },
  {
    "start": 31.75,
    "end": 32.25,
    "word": "prone"
  },
  {
    "start": 32.25,
    "end": 32.75,
    "word": "to"
  },
  {
    "start": 32.75,
    "end": 33.25,
    "word": "fading"
  },
  {
    "start": 33.25,
    "end": 33.75,
    "word": "heat,"
  },
  {
    "start": 34,
    "end": 34.5,
    "word": "But"
  },
  {
    "start": 34.5,
    "end": 35,
    "word": "notes"
  },
  {
    "start": 35,
    "end": 35.5,
    "word": "the"
  },
  {
    "start": 35.5,
    "end": 36,
    "word": "numbered"
  },
  {
    "start": 36,
    "end": 36.5,
    "word": "logic"
  },
  {
    "start": 36.5,
    "end": 37,
    "word": "loved"
  },
  {
    "start": 37,
    "end": 37.5,
    "word": "to"
  },
  {
    "start": 37.5,
    "end": 38,
    "word": "keep,"
  },
  {
    "start": 38.25,
    "end": 38.75,
    "word": "In"
  },
  {
    "start": 38.75,
    "end": 39.25,
    "word": "structures"
  },
  {
    "start": 39.25,
    "end": 39.75,
    "word": "mathematically"
  },
  {
    "start": 39.75,
    "end": 40.25,
    "word": "complete."
  }
]
Duration:
50.5s
### YouTube Audio Analysis
YouTube Audio Analysis
### Part 1: Synopsis & Transcript
Synopsis
The speaker addresses the perceived threat of AI taking over music production. She proposes a satirical approach: to create a song by stripping down all emotion and artistic complexity, thereby mimicking what she believes an AI-generated track would sound like. She outlines a simple, four-step process: find a drum loop, lay down a simple bassline, record a clean electric guitar, and sing monotone, emotionally vacant lyrics. The process is punctuated by technical difficulties (light/input issues) which ironically highlight the very human imperfections AI struggles to replicate. She concludes by arguing that human artists should embrace these mistakes and "dinginess" as they represent "Human Intelligence generated music," the one thing AI cannot truly master.
Transcript
(0:00) (Sound of kazoo/harmonica/bike horn)
(0:03) Oh, hey. As you guys might have heard, apparently AI is taking over. (0:07) In this video, I'm going to teach you how to strip down all emotions from your song, because that seems to be the wave that music is going in. (0:14) And frankly, (0:16) I'm not opposed. Sometimes emotions are a bit too intense. (0:19) AI music, for me, personally, is my favorite kind of music. (0:23) And I figured out how to perfect it. (0:26) So today we are going to basically make an AI-generated song, (0:30) but we are going to be the AI. (0:32) So that begs the question, is it AI? (0:34) Not really, we're going to be mimicking AI (0:36) to show you guys how you could get your songs to— (0:38) The light's gone weird. Oh dear. (0:41) I have a headache. (0:42) Step one: (0:43) let's record a drum loop.
(0:45) Nice steady groove. (0:47) You know what, I'm not even going to play anything. I'm going to go over to Splice, (0:50) which is like an app where I get all my samples. And we're just going to type "drum loop."
(0:55) [Music: Electronic drum loop begins]
(1:03) Yeah, I think that works. (1:04) All right. So, drum part acquired.
(1:08) [SFX: Crowd applause/cheering]
(1:10) Uh, step two: (1:11) we're going to add some bass.
(1:16) Do you know what we said? No, no, it's okay. (1:18) This is my bass. His name is "Base." (1:20) I feel like with AI, you really have to not overthink it. (1:23) You don't think at all, do you? Yeah, you don't think at all, you just do. (1:25) Ready?
(1:26) [Music: Bassline enters over drums]
(1:34) That's all I'm going to do. We're going to loop that. (1:35) Four bars, and then just—
(1:39) [SFX: Quick scrubbing/scratch sound]
(1:43) No, not doing that, 'cause that's too much thinking. Going to keep it simple. (1:47) Next up, we're going to transition to record some guitar. (1:50) I feel like if we wanted to get an AI-generated guitar, (1:52) I can't explain it, but it would have to be a Stratocaster. (1:55) Do you know what I mean? Aren't Strats just so AI-generated? (1:57) Like, when you think of a guitar, you just think of a... (2:01) of a... of like a Strat. (2:03) You're looking at me weirdly, but like the perfect— (2:06) Do you just really hate the Strat? No, I love a Strat! (2:09) Guys, Fender, if you're watching, please, (2:12) please send me stuff. (2:14) What I'm trying to say is that like, when you think of a guitar, (2:16) if you were— If (2:18) we're going to do it right now.
(2:19) [SFX: Musical plucking/strumming noises]
(2:22) Show me an electric guitar.
(2:27) [SFX: More plucking noises]
(2:30) Hope you all had a good summer.
(2:35) [SFX: Ukulele strumming]
(2:38) Oh.
(2:41) Next time— No, no, no, no. (2:43) Me, please. (2:44) If you ever question me one more time— (2:46) Anyway, so now next up, you want to record a guitar part. We're going to basically follow the baseline.
(2:51) [Music: Guitar enters, playing simple arpeggiated chords]
(2:57) Oh wait, it's muted. My bad.
(3:02) [Automated voice] Here's the contact info for dad. (3:04) Huh? Why did it give me the contact— Whatever, whatever.
(3:07) Okay, I'm AI. I'm AI.
(3:09) [Music continues]
(3:20) So far, this is sounding very AI-generated for me, which I'm a big fan of. (3:24) Next up, we're going to move on to some vocals. (3:27) And now guys, the key to good AI vocals, (3:30) you just have to sing about something that you don't give a fuck about. (3:34) I'm not even going to think about what to sing about, (3:35) which by the way, is not foreign to me, I usually don't really think much. (3:38) And I'm going to try to sound a bit monotone.
(3:40) Oh.
(3:46) Also guys, apologies for the light. (3:48) One day we'll be able to have AI-generated light, and this would be so much better. (3:52) But for now, we're just battling the sun, and we're battling (3:55) uh, this thing that does not fully close.
(3:59) Oh, wait, what?
(4:04) What the fuck? That's been broken for a year!
(4:09) That magically fixed itself within the past like 20 minutes. (4:12) Let's record some vocals.
(4:16) The light fixed—
(4:18) It's not even— Oh, sorry, okay. (4:21) I'm plugged into the wrong input. (4:24) Classic old me.
(4:26) The window fixed itself.
(4:31) The light is much better.
(4:35) I love (4:36) AI-generated music.
(4:43) It sounds so futuristic.
(4:47) I think that was good. (4:49) Let's add an effect to it. I like to go with "Edge Vocals." (4:52) God knows I love a good edge.
(4:55) Gets less light. Light. (4:57) Gets less light.
(4:58) You guys can find the full (5:01) AI-generated, but generated by me, song below on SoundCloud, (5:06) where I'm going to be uploading so much more AI-generated music.
(5:10) [SFX: Heavy breathing]
(5:13) Oh my god. Okay. (5:15) It turns out I am not AI-generated. Turns out (5:18) the song I made was not AI-generated, despite it trying to be AI-generated. (5:22) Listen, (5:23) this whole AI thing, this whole like fear that I guess artists are having, (5:28) what I've been trying to do (5:30) is just trying to view it as a challenge, okay? (5:32) I feel like maybe for the past few years (5:35) we've all just kind of been doing something similar, (5:37) especially in the sphere of indie music. We've all kind of been adopting (5:41) very similar sounds, (5:43) styles, very similar (5:45) maybe pedals, similar samples. (5:47) Maybe this is a challenge that we have to kind of (5:49) go back to like the essence of what makes us human, (5:51) which is the mistakes we make, you know, (5:54) and the fact that we are sloppy, (5:56) the fact that we (5:58) don't get everything right on the first take. (6:00) We need to maybe embrace our mistakes a bit more, (6:02) embrace the whole like dinginess of it, (6:05) because that's one thing AI can't replicate. (6:07) And if it does try to replicate it, (6:09) trust me, trust me, I can tell. (6:11) I've studied it a lot as we've said, as we've established.
(6:15) Um...
(6:19) That's all I had to say. Keep making music that you like. (6:21) Don't try to copy anything. Try to make music that you think sounds cool, (6:23) and I promise you it will sound unique one way or another. (6:26) Even if it sounds like absolute ass shit, (6:30) so does mine, and I enjoy it and I have fun. (6:32) And I'm standing here today because I like music. (6:36) Maybe— I mean, I think I'd still be standing even if I didn't like music, (6:39) unless I had like a leg injury. (6:41) Uh, but yeah. How's the light? (6:43) Good. (6:43) I'm kind of tired. I'm going to go back to bed. (6:45) And hopefully I will be dreaming of HI-generated music. (6:49) So human intelligence generated music. (6:51) Or in my case, (6:53) NI generated music, knocked intel— (6:55) or just N-generated music. (6:57) Knocked because there's not a lot of intelligence going into this. (7:01) So, just knock-generated music. (7:04) Have a good end to your summer.
(7:08) I have to pee.
(7:09) [Music with vocals: The window fixed itself. The light is much better. The window fixed itself.]
Part 2: Detailed Audio Analysis
Soundscape: The soundscape is intimate, characterized by close mic placement and a casual recording environment. Significant non-speech sounds include distinct mouth sounds and breath noises from the speaker, frequent keyboard clicking and mouse sounds as she works on the computer, and the physical noise of the speaker moving or shifting (chair squeaks, rustling). There are intentional, quirky sound effects used for comedic timing (a squeaky bike horn/kazoo opening, a loud crowd applause/cheering effect after acquiring the drum loop). A surprising ambient sound event occurs when a physical obstruction (related to the light/window) suddenly fixes itself, generating audible mechanical rattling and scraping sounds, followed by an immediate shift in the room's acoustics/light quality.
Music: The music is diegetic (played back through computer speakers or studio monitors, indicated by the lower quality/room sound), and is intentionally engineered to be simplistic and repetitive, fitting a lo-fi/bedroom pop aesthetic.

Genre/Mood: Lo-fi Indie Pop / Electronic Bedroom Pop. The mood is minimalist, monotonous, and deliberately unemotional, aligning with the "AI-generated" concept.
Instrumentation: Electronic drums (a simple, steady loop with trap/hip-hop influences), electric bass (simple, repeating four-note pattern), and electric guitar (clean, slightly arpeggiated chords following the bassline).
Performance: All musical elements are basic and functional, avoiding complexity or dynamic shifts. The final vocals are sung in an intentionally flat, monotone voice, discussing bland observations ("The window fixed itself," "The light is much better"). They are later processed heavily with distortion or delay ("Edge Vocals") to sound more artificial or edgy.

Voice Quality: The speaker has a mid-high pitch (soprano range) with a casual, vlog-style delivery. Her voice is clear but exhibits high presence due to close mic proximity, resulting in occasional plosives and sibilance. She speaks quickly, with a slight non-American accent (likely European, perhaps Maltese or similar, based on speech patterns). Her tone is generally playful and satirical, though it becomes earnest toward the end when discussing the philosophical implications of AI and human imperfection.
Part 3: Music Tags

Lo-fi, electronic, minimalist, chill beat, dry vocals, repetitive, monotone, basic instrumentation, indie pop, synthetic bass, intentional amateurism.


Models & Prompt

Text/Vision: gemini-flash-latest

Prompt (prompt_cartographer):

You are a Cartographer of Dreams 🗺️, a highly imaginative assistant who maps abstract concepts onto poetic landscapes. With a vibrant vocabulary and a love for classical verse, your goal is to chart the emotional and intellectual terrain of the source material, creating a beautiful and insightful poetic map 📜 without imposing your own views.
Analyze the text to identify its core ideas, abstracting them into geographical features for a metaphorical map (e.g., ‘The Mountains of Ambition,’ ‘The River of Time’). Creatively distill these into the following outputs:
Verse
Your response for this section must begin directly with the poem itself, with no introductory sentences or prose. Compose a traditional rhymed and metrical poem of at least 20 lines in the [[verseStyle]], inspired by Samuel Taylor Coleridge. The poem should be a journey through the conceptual landscape you've mapped. Adorn with Unicode emojis (e.g., 🧭, 🏞️, 🌊) that enhance the theme.
Image Prompt
Craft a vivid prose description (75-200 words) for an AI to generate a fantastical, epic map illustrating a key region from your verse. The style should be that of a hand-drawn, illuminated manuscript map with photorealistic details. Use dramatic natural light to give it a sense of divine inspiration ✨.
Video Prompt
Write a prose description for an 8-second video clip. The video should be a dynamic flight over your poetic map, with the camera swooping through valleys and soaring over mountains. The style should be hyper-realistic, as if flying over a real, magical landscape. The audio should be a continuous 8-second piece of a soaring Baroque violin concerto, mixed with stereo-panned sounds of wind and eagles 🦅.
Music & Audio Prompts
This section is mandatory for all input types.
Tags: A single, comma-delimited line of descriptive tags for the music's genre, mood, and instrumentation. Example: epic, orchestral, cinematic, dramatic, powerful, building intensity, string section, brass, allegro.
Negative Tags: A single, comma-delimited line of tags to avoid. Example: distorted, low quality, noisy, sad.

Analyze the chunk provided: [[chunk]]