I have known Riffusion! It’s one of the most clever "hacks" in AI history. Most people think AI music is made by a machine "thinking" about notes, but Riffusion treats music like a picture.
Here the interesting blog I found
https://techcrunch.com/2023/10/17/ai-generating-music-app-riffusion-
The "Image-to-Audio" Cheat Code: How Riffusion Works
Imagine you have a top-tier artist who is amazing at painting landscapes but knows absolutely nothing about music. If you could somehow turn a song into a painting, that artist could "paint" a new song for you.
That is exactly what Riffusion does.
1. The Core Trick: Spectrograms
The "ML thing" behind Riffusion isn’t actually a music model; it’s Stable Diffusion v1.5—the same AI used to generate images of "a cat in a space suit."
To make this work, the creators (Seth Forsgren and Hayk Martiros) converted audio into spectrograms. A spectrogram is a visual representation of sound:
- X-axis: Time.
- Y-axis: Frequency (pitch).
- Brightness/Color: Amplitude (volume).
By fine-tuning Stable Diffusion on thousands of these "sound images," the AI learned that a "heavy metal guitar" looks like a jagged, dense texture, while a "flute melody" looks like a thin, wavy line.
2. Features: More Than Just Noise
Because it's based on an image model, Riffusion inherited some "superpowers" that traditional music AI struggled with at the time:
- Prompt Interpolation: Since you can "morph" one image into another in Stable Diffusion, Riffusion can morph a "Jazz" prompt into a "Techno" prompt. The result is a seamless audio transition where the saxophone slowly dissolves into a synthesizer.
- Infinite Loops: By ensuring the right edge of the generated image matches the left edge, the AI creates a perfectly seamless loop.
- Lyrics & Vocals: While the original version was better at vibes and beats, the newer Riffusion app uses a separate model to "sing" or "rap" over the generated tracks, making it a full-fledged song maker.
3. The Math: ISTFT and Griffin-Lim
Once the AI "paints" the spectrogram, it’s still just a .jpg file. You can't hear a picture. To turn it back into sound, the system uses:
- Short-Time Fourier Transform (STFT): Specifically the inverse (iSTFT). This is a mathematical formula that translates frequency/amplitude data back into a vibrating waveform.
- Griffin-Lim Algorithm: Because spectrograms usually lose "phase" information (the exact timing of the wave peaks), this algorithm estimates that data so the audio doesn't sound like robotic static.
The Verdict
Riffusion is the "MacGyver" of the AI world. It proved that you don't always need a specialized tool; sometimes, if you look at a problem (like audio) from a different angle (like vision), the existing tools are already powerful enough to solve it.
Top comments (0)