Issues With Web UI Generation

With my recent move into the AI+UX space, I'm noticing unresolved issues in Generating Web UI with LLM. I want to revisit these issues as my research progresses.

Hand-eye Coordination

What AI "sees" is not what AI "touches".

Vision model perceives screenshots. Text model perceives serialized DOM tree and style sheets. But to generate/edit UI, the model often affects source code.
Screenshots (binary), DOM (tree), and source code (graph) are orthogonal representations of states, connected by causal relationships:
Source -[compiler/bundler] -> JS/CSS -[interpreter/renderer] -> DOM/CSSOM -[renderer] -> Screens
How can we expect AI to successfully manipulate the source code while only being able to use the final results as objective/feedback?
Modern tools and frameworks make it worse. The DOM is littered with CSS-in-JS garbled class names and deeply nested divs. The source code can be TypeScript or JSX, which further increases the causal distance from source to target.
We can manipulate surface level representations, e.g. DOM or generate the screenshot, but the edit is cannot be persisted back to the source code.
How do we express goals and constraints in the visual space? tldraw/makereal is a solid start.
How do we engineer "backpropagation" for Web UI Generation?
How do we provide "direct manipulation" to AI?

Unknown Unknowns

Without great documentation, AI can't tell what is possible and must rely on guessing, or worse, hallucinating.

Using popular libraries with good documentation is better than esoteric or private libraries. A case in point is Claude 3.5 Artifact successfully leveraged shadcn/ui according to the leaked system prompt. Other good libraries are Material UI and Ant Design.
In any UI component library, not every component is compatible with every other component. There is an implicit "compatibility matrix" that is not documented. Composition is not necessarily better than inheritance for use by AI.
Due to knowledge cutoff, popular libraries also suffer from knowledge gap when AI uses their latest releases.
Open-ended generation is easy. Following specific instructions is hard. Satisfying all constraints is extremely hard, and maybe NP-hard.
AI fears blank canvas too! It's a good idea to turn generation problem into extrapolation and interpolation problems. We essentially use existing UI as "few-shot" examples to generate UIs that are variations on the same theme. Photoshop generative fill and Figma AI's Add relevant content both take advantage of this pattern.
What if the elements in the UI can encode their domain and range? In addition to self-documented code, can we have self-documented UI?
Towards Evolutionary Design. Isn't our DNA a form of self-documentation? Can we distribute the complexity of documentation across the UI elements and use hierarchy for abstraction/compression?
How do we progressively disclose documentation without overwheliming the AI? In addition to hierarchy as seen in atomic design, will knowledge graph better represent the relationships between UI elements?

The Time Dimension

Change is the only constant.

The source code is not a static representation of the desired UI. Abstraction such as "signals" and "hooks" represents how information changes over time. There is no handle for AI to grab onto such change.
The transformation takes place outside of AI's perception:
DOM -[User interaction or time triggered events] -> Changed DOM
"Shallow" mock-ups are much easier than "deep" behavioral prototypes. In other words, AI is good at drawing dead fish. But we want to generate UI, not screenshots.
Figma prototyping tool already bottomed out here, manifested as prototype spaghetti. No-code low-code app builder and visual programming IDE like Sratch haven't solved it either. Instead they either fully encapsulate the behavior into premade components, or kick the can down the road by having the user draw the unwieldy state machine. There is a huge market waiting for a better solution.
We might benefit from models that can natively perceive videos, instead of keyframes.
We could diff the UI in both text and visual space to represent change. Or we should simply show the before and after.
Where do get training data for the diff of UI states?
What is the tradeoff between declarative and imperative representation of change? What can we learn from Operation vs State-based CRDTs? What is AI's cognitive load when reasoning over each?