Recent advances in Audio-Language Models have boosted performance across musical understanding benchmarks for long-form music fragments. By leveraging previous work in Music Information Retrieval, researchers have found ways to address the data scarcity issue of music-text pairs: LLMs synthesize a combination of outputs from multiple MIR models, existing metadata, and contextual knowledge into richer captions. This raises the question: to what extent does this approach provide the model with true music intelligence? Is the model aware of the content and intentions of the music, i.e., the artistic image, or does it just pretend to be? Is there a risk that GPT-generated or GPT-enhanced captions introduce hallucinations in the training data, which then leads to a drift in performance? Similar to the debate about whether models are able to truely understand the meaning of text, this one is about: how do ALMs listen, and what do they hear?
Posted: Monday, February 16, 2026.
02
Clouds, Streams, and Ground (Truths)
Conference notes from UC Berkeley, March 7-8.
ConferenceMusic Ecosystems
Conference Notes
I recently attended the Clouds, Streams, and Ground (Truths): Developing Methods for Studying Music Ecosystems conference on March 7-8 at UC Berkeley. The event highlighted cutting-edge topics in the field of music, featuring talks on:
Human-AI Co-improvisation: exploring the use of a physical, joint medium of interaction
Algorithmic Curation: Generating interpretable music playlists by navigating a musician's social networks.
Copyright Law: Navigating the complex legal challenges surrounding AI-assisted artworks (is a 600-prompt artwork copyrightable?).
Media Evolution: Analyzing the trend in which the once-sole transmitter of music ("vapor") has become "frozen" when
Human-Algorithm Interaction: Developing novel methodologies to capture how humans interact with algorithmic agency.
Posted: Monday, March 9, 2026.
03
Generative UIs
Thoughts on dynamic controls, adaptation, and what still feels missing.
UIGenerative AI
A Reflection on Generative UIs
My first reaction to many generative UI ideas is: how is this different from vibe-coding a plot or graph? Visualizing data through code is already something many models can do. What feels new is adding a layer on top of natural-language instructions with interface controls like sliders, drop-down menus, and input boxes, where prompt refinement becomes the core interaction.
It does seem more efficient in some cases, especially when users are asked to try many small changes like legend placement or axis label orientation. But I think those comparisons should be taken with a grain of salt, because these are exactly the kinds of edits that widget-based interfaces are designed to make easy.
Given the relatively deterministic outputs of this kind of system, compared to diffusion models, and the predictability of instructions like "show the legend at 45 degrees," I wonder how much added value dynamic controls provide over vibe-coding. To me, the bigger value in visualization editing is exploring fundamentally different views of the same data, or reacting to goals like "I want the graph to convey X." A compelling direction is turning those high-level goals into smart controls that can increase or decrease aspects of X.
Beyond Responsive Design
The idea that every application should adapt to any device in a way that is specifically designed for the user feels like a meaningful path toward the future of interfaces. Adapting layout to device constraints already feels mostly solved: browsers and apps resize and reflow content well.
What still feels underdeveloped is adapting to each person's preferred way of interacting with an interface. Outside recommendation systems and content personalization, interfaces themselves rarely reshape around users in real time. I wish my phone or computer would support me in daily tasks with the right controls and UI patterns exactly when I need them. For me, that is the real promise of generative UIs.