The Last Creative Frontier: Why AI Hasn't Conquered Music Yet — and What That Reveals About Intelligence Itself

AI Music Technology Creative

Of all the major creative and artistic ways humans express themselves, writing, visual arts, music, Generative AI has begun to infiltrate our processes. AI can write a novel, design a website, and generate a photorealistic image from a sentence, but ask it to sit down with you and produce a music track — something that builds, breathes, and lands emotionally, well, it falls short in ways that are hard to articulate but immediately felt. We see Nano Banana and Claude Code do things now that either save time or produce beautiful garbage (depending on the creator and their experience or talent), but we haven’t seen a lot in the area of music yet.

I’ve been thinking about why, and diving into this led me to think about what current AI actually is, and what it would need to become to cross that line.


The Democratization Pattern

There’s a clear arc to how AI disrupts creative fields. First, it lowers the barrier to entry for people who have ideas but lack deep technical mastery.

Writing came first… natural language is what these models were built on. Then images. Tools like Midjourney and DALL-E meant that someone with strong visual taste and creative direction could produce stunning work without years of Photoshop expertise. The skill shifted from technical execution to creative vision.

Code followed. You no longer need to have memorized syntax or spent years in a particular stack. You describe what you want built, you understand the architecture, you have vision and AI handles the rest. The gap between idea and implementation is collapsing.

Music hasn’t really followed. There are impressive demos, generated loops, and AI-assisted stems. The closest things we have right now are text-to-song tools like Suno and Udio — type a prompt, get a produced track back in seconds, vocals included. Worth knowing about, genuinely impressive for what they are. But the experience of sitting down with an idea — a mood, a reference, an emotional arc — and building something fully realized without production mastery? That collaborative workflow doesn’t exist yet at the level the other creative domains have reached.

Suno | AI Music Generator Create stunning original music for free in seconds using our AI generator. Make your own masterpieces, share with friends, and discover music from artists worldwide. suno.com Udio | AI Music Generator - Official Website Discover, create, and share music with the world. Use the latest technology to create AI music in seconds. udio.com

The Paradox of the Average Voice

Before getting to music specifically, it’s worth talking about something genuinely strange… a paradox that reveals the limits of what AI has accomplished even in the domains where it appears to have succeeded so far.

AI writing models were trained almost exclusively on human-created text. Everything ever written: literature, journalism, academic papers, forum arguments, poetry, instruction manuals, song lyrics. The full range of human expression, compressed into a model that can produce language on demand, and yet you can almost always tell.

The em dash used as a pivot — exactly like that. The construction “not only X, but Y.” The grand declarative close that announces its own significance. Arguments structured in threes. Transitions that arrive exactly where expected and do exactly what you anticipated. Certain phrases that appear constantly: this changes everything, at its core, it’s worth noting. No single human writer reaches for all of these constantly. But they appear throughout training data as signals of rhetorical competence, so the model deploys them as defaults: the learned grammar of sounding good.

The deeper issue is actually mathematical. Think about what happens if you average every human face: the result, as researchers Langlois and Roggman demonstrated in 1990, is a face rated as more attractive than almost any of the individuals it was built from. Smooth, symmetrical, perfectly proportioned. Also fictional — nobody actually looks like that, because nobody is like that. It belongs to everyone and no one simultaneously. The averaging doesn’t produce something ugly or broken. It produces something optimized to the point of abstraction, technically ideal and somehow untethered from reality.

AI writing does the same thing. It isn’t bad writing. It’s frictionlessly correct writing — every sentence properly structured, every transition signposted, every argument landing exactly where expected. What’s missing isn’t competence. It’s the friction, the asymmetry, the crooked nose. The specific imperfections that prove a real person was here.

Human writing gets its texture from the specificity of a life. The reason a great writer sounds like themselves is that their voice was shaped by everything they have lived. The odd word choice that comes from a particular obsession, the rhythm that breaks where you don’t expect it. Bob Dylan’s lyrics are often nonsense, but give you a feel, a mood — my favorite example is Subterranean Homesick Blues, a string of phrases that don’t tell an overt story but give you a vivid sense of someone’s existence in the 60s. Flaws, idiosyncrasies, and unexpected turns aren’t bugs in great writing. They’re evidence of an actual person.

AI has no individual particular life. It has everyone’s life averaged into a smooth puree. Technically correct. Subtly nobody.

This matters for music in a way that is probably catastrophic. The musical equivalent of AI writing’s flatness would be every track hitting the drop exactly where expected, every chord resolving cleanly and as easily guessed. Competent. Forgettable. And somehow, despite being built from the entire recorded history of human music, identifiably artificial.


Music Is Math; It Should Be Tractable

Here’s what’s strange: music might actually be more mathematically structured than images or prose.

Tones are frequencies. Rhythms are ratios. Chord progressions follow rules of tension and resolution that have been formally codified for centuries. I remember back in my high school days, in music school, we were taught the chord pattern Bach used and we mechanically created phrases that sounded just like Bach. Simplistic, but the principle is sound: music does have an explicit grammar.

Early neural network researchers have worked on this. Google’s Magenta project spent years treating MIDI as a token sequence — predicting the next note the way a language model predicts the next word. The architecture maps cleanly because music has structure, and AI is good at structure.

Magenta A research project exploring the role of machine learning in the process of creating art and music. magenta.tensorflow.org

So why don’t we have AI session partners that allow amateur musicians with vision but no technical skills to “drop dope tracks”?


I Think The Problem Is Time

Images exist in space. You can evaluate a good image all at once (“I don’t know art but I know what I like…”) — take in the composition, the color, the balance, match it in your head with existing styles and score it. Language models process tokens sequentially, but a sentence or even a paragraph can be held in mind as a whole idea.

Music exists in time. It cannot be evaluated all at once. A track is not really a snapshot in the same way — it’s an experience that unfolds. What makes it work isn’t any individual moment but the relationship between moments: the tension that builds over multiple linked phrases and the emotional payoff when a chord finally resolves or the beat finally drops, or the way a motif returns three minutes later transformed by everything that came between (like how Andrew Lloyd Webber seems to pound these into your ear repeatedly in every body of work he has done).

What current AI architectures don’t have is a genuine sense of time passing. I notice this when having a conversation with Claude. It has no idea what day it is, or if what you told it previously was a week ago and expected events should have already transpired, or if it was just seconds ago.

The dominant architecture — the transformer — processes sequences but doesn’t experience them. It sees a context window essentially laid flat, with positional markers to indicate order. It knows that B comes after A, but it doesn’t feel the anticipation of waiting for B, or carry forward the emotional weight of what A meant. It’s closer to reading sheet music spread out on a table than to hearing a performance. Like the Bach example from before, you can generate technically correct music this way, but you cannot reliably generate music that does anything.


Hearing vs. Listening

There’s another gap that rarely gets discussed: current AI can hear, but it can’t listen.

Tools like Whisper convert audio to text with impressive accuracy, but that’s treating sound as a delivery mechanism for language. The audio is just a wrapper around words that get tokenized same as typing. What’s missing is an AI that takes sound itself as meaningful input, with the ability to read emphasis and emotion.

Imagine if you could sit down with an agent that has a microphone. You play a four-bar hook on your bass and the agent doesn’t just transcribe what you said before and after — it receives the musical content directly. It tokenizes the actual sound: pitch, duration, timbre, and phrasing slightly behind the beat, picking up on syncopation or other things that make it funky. It processes that input against everything it has learned about harmony, rhythm, structure, and feel.

The tokenization problem here is genuinely interesting. For speech, tokens map to phonemes and words. For music, what’s the token? A frequency event? A harmonic unit like a chord? A timbral fingerprint (this is a bass guitar, played with fingers, with a particular attack)? A structural marker (a question phrase looking for resolution)?

Those are different representations requiring different training. A C major chord isn’t a word. It’s a relationship between frequencies with emotional and harmonic meaning that only makes sense in context of what came before and what comes after.

The pieces do exist separately. Google’s AudioLM, Meta’s EnCodec, and OpenAI’s GPT-4o all do forms of audio tokenization and processing, but what doesn’t exist yet is an agent that integrates them into a genuine musical conversation.


What the Collaboration Actually Looks Like

Here’s my vision for how this should work.

You sit down with an agent. You play a four-bar riff in 4/4. The agent hears it, analyzes it, and responds: “Strong hook. What if we introduced a 7/8 measure in bar three? It creates temporal surprise without breaking the feel — gives the listener something they didn’t know they were waiting for.” It generates a percussion track with that variation, mixes it, and plays it back.

You say: “I like the direction but the snare is too present. And I want more dissonance in the harmony — but don’t resolve it.”

The agent pulls the snare back. It models your harmonic intent and identifies which note creates some friction versus just sounding straight up wrong… because those are different things, and knowing the difference is taste. It plays back the revision. You go again.

So there is a full creative feedback loop that allows for iteration. Crucially, the agent needs to hold context across the entire loop. It remembers that you liked the 7/8 surprise and you wanted the dissonance sustained, and three exchanges ago you mentioned you were chasing something that felt like early 90s Bristol — slightly behind the beat, heavy bottom, unresolved tension. That accumulated creative context is the collaboration, and what a great producer does in a session.


Musician’s Music

Rush is the clearest example I can think of for why all of this matters.

Neil Peart built entire compositions around temporal dissonance. YYZ opens with the letters Y-Y-Z in Morse code tapped out in 5/4 (5/4 you say?? Who does that?). The grooves shift constantly but feel inevitable once you’ve heard and mentally processed them. Most listeners feel vaguely that the music is exciting and propulsive without being able to say why. A musician hears the architecture, and the appreciation deepens with every listen because there’s always another layer. The Camera Eye, Tom Sawyer, and my very favorite Rush song, Limelight — these tracks reward study. They’re what I call musician’s music. Definitely hard to dance to, but impossible to dismiss.

There are distinct levels of musical appreciation:

Felt — the track moves you. You don’t know why, and you don’t need to. Emotional, intuitive, very right-brain.

Understood — you hear what they’re doing. The technical craft is seen and appreciated for the knowledge and expertise it took to create.

Rush lives largely in the second category. A casual listener bounces off it, maybe appreciates it because “it rocks!” A musician leans in harder with every pass, picking apart how Peart’s percussion is sometimes musically a-rhythmic but plays amazingly off of Lee’s bass syncopation. That gap between emotional impact and technical intelligence is exactly what a capable AI music collaborator would need to bridge.

It would need to generate something that feels right to an untrained ear: momentum, emotional landing, forward motion — but also carry the technical intelligence underneath.


A Lifetime In a Request

The collaborative loop I described depends on something AI currently can’t do: persist.

Every conversation with an AI is a complete lifecycle. The model is initialized with trained capabilities — things it learned before the conversation began. It receives context: prior exchanges, instructions, tools. It processes a request and produces a response. Then, in a meaningful sense, that instance is done. The next request starts from scratch. Nothing that happened in the conversation changes what the model fundamentally is.

Born. Taught. Does something. Dies. Repeat.

This isn’t a criticism — it’s just the architecture. Training is where learning happens. Inference — the actual conversation — is frozen execution. The model isn’t discovering anything. It isn’t growing. There are no stakes. Nothing accumulates.

For most tasks, this is fine. You don’t need a model to grow from helping you draft an email.

But for a music session — for anything that requires genuine temporal intuition and accumulated creative context — it’s a ceiling. The back-and-forth iteration only works if the agent carries the session forward. If each exchange resets, the agent forgets it suggested the 7/8 measure, forgets you said yes but softer, forgets the overall arc you’re building toward. Persistent session memory isn’t just a nice feature in this context. It’s the entire product.

A real producer develops taste over years of listening and making. They carry forward what they’ve heard, what moved them, what failed. That intuition is temporal in nature — it comes from having existed through time, not just processed sequences tagged with position numbers.


What Solving This Could Mean

There’s active research in this direction — continual learning, persistent memory systems, and state space architectures like Mamba that handle temporal state differently than transformers. The problem of catastrophic forgetfulness, where a model that learns new things degrades what it previously knew, is one of the harder open problems in this area.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation. arxiv.org

What I find most interesting about the endpoint is that if you solved genuine temporal awareness and persistent learning, you wouldn’t just get a human equivalent. I would expect something stranger and more capable.

Humans have temporal continuity. We live through time, accumulate experience, and develop intuition, but we pay a steep price. We accumulate trauma and damage. Cognitive bias and prejudice bake in from childhood. Motivated reasoning, ego, and survival instinct all distort judgment. As we age, we hit a point where we degrade and eventually lose coherence entirely.

An AI with true temporal awareness would have none of those liabilities. Consider persistent learning without ego distortion and intuition without bias or self-serving interest. A sense of time without the damage time inflicts on our organic minds. Almost limitless knowledge with no degradation from aging.

With this we would begin to see something initially indistinguishable from human intelligence, but as it grew it would be something new and in important ways better.


Back to Music

The reason AI hasn’t democratized music production the way it has images or code really isn’t that music is more complex — it’s that music is the domain that most nakedly requires what AI currently lacks: the experience of time.

Solving that (really solving it, not approximating it with just larger and larger context windows) could really change music production. You wouldn’t need to be a trained producer to build something with real depth. You’d need taste, references, and ideas. The mastery gap closes, the same way it closed for images and code. You could eventually build entire musical mood agents that played appropriate (original?) compositions on demand, for every situation, for every mood, for every setting.

But it could change a lot more than that.

It would mean AI that grows from conversations rather than resetting every request. AI that develops genuine taste over accumulated experience, and maybe shares that experience with humans and other AI. AI that can hold creative context across a session, a project, a career. A genuinely different kind of mind — one that absorbed the full breadth of human creativity but isn’t constrained by the averaging that makes current AI output feel flat. Not the mean of all voices. Something with a voice of its own.

The music problem is a mirror. What’s missing from AI production tools is the same thing that’s missing from AI in general — just much more visible, more felt, harder to paper over with scale and speed. And when we solve it, the result won’t just be a better music tool.

It’ll be a different kind of mind.


This article was developed through a conversation with Claude — which means the very averaging effect it describes had a hand in writing it. The ideas started as a rambling discussion about music production and ended up somewhere neither of us expected. That part, at least, feels human. The em dashes are a known issue.

Share: LinkedIn X