BORN TO LEARN

Behaviour is a design axis for embodied self-supervised learning. How a simulated infant moves measurably shapes the body schema it learns.

2026.05 · PAPER · UCL COMP0225

[A] THE QUESTION

A body schema is the sensorimotor, action-oriented internal model an embodied agent maintains of its own geometry, posture, and reach. In humans it is not innate. Infants build it in the first weeks of life out of self-touch, proprioception, and motor babbling.

Most embodied self-supervised learning (SSL) work treats this as a data-and-architecture problem: bigger encoders, more frames, smarter losses. We asked a different question. If behaviour shapes representation in biology, does it do the same under modern SSL objectives?

Concretely: take one simulated infant body, fix the architecture, fix the loss, fix the budget. Vary only the behaviour that generates the training data. Measure what changes.

[B] THE PIPELINE

Three stages, on the MIMo infant body in MuJoCo, 88-D fused proprio + tactile state, 88-D muscle action space.

Stage 1, behaviour generation. Two recipes for producing self-touch on the same body. A per-region mixture of experts: five SAC specialists, one per body group (head, upper torso, lower torso, legs, arms), composed at inference by switching between specialists. And a skill-conditioned QDAC actor: a single SAC actor trained under a quality-diversity objective, conditioned on a continuous five-dimensional rostro-caudal skill prompt. Two faces of the same axis: many small specialists composed by switching, against one larger model with continuous skill control.

Stage 2, dataset generation. Roll each policy out under several composition rules. Summarise every dataset with a fixed set of behavioural descriptors: self-touch entropy, self-touch fraction, bimanual fraction, pose vigour, movement coherence, plus descriptor-space coverage for the QDAC sources. 32 dataset sources across structured, random-babbling, and scripted families.

Stage 3, body schema training and evaluation. One body schema model per dataset, trained from scratch under a single fixed recipe (a dual VAE over proprio and touch into an 88-D latent, plus a residual forward model). Architecture, optimiser, budget, seeds held constant. Only the training data changes.

[C] THE EVALUATION SUITE

VAE reconstruction loss is not a body-schema metric. A low-loss latent can be entangled and ungrounded. So we built six probes, each tied to a different property a real body schema should have:

TRC (touch-region correctness): does the latent decode contact onto the right body region. TPD (two-point discrimination): does it preserve the metric structure of the body surface. IMR (inverse-model recoverability): can the action that produced a transition be recovered from the latent change. BFHP (body-frame hand-pose prediction): does the latent localise the effectors in body-centred coordinates. MIG (mutual-information gap): is the latent organised rather than merely compressed. RSA (representational similarity analysis): does the latent track the kinematic chain.

Each probe runs on two tiers: in-distribution and a shared leak-free held-out set. The construction is intended to be reused beyond this project.

[D] WHAT WE FOUND

Behaviour shapes representation, measurably. Structured exploration beats random and babbling baselines on every load-bearing probe. The schema-level composite ranking puts every MoE schema above every baseline. At the 150-episode budget, the gap between the best and worst behaviour is several standard deviations on every probe in the suite.

Structured movement wins; vigorous motion does not. Movement coherence (the fraction of proprioceptive variance captured by the top principal components) is the strongest positive predictor across the suite (Spearman ρ = +0.75 with RSA-joint, +0.70 with MIG-joint, +0.63 with TPD). Pose vigour, the mean per-step joint-angle change, is the strongest negative one. The lever is whether the body moves in a coordinated, low-dimensional way, not how hard it moves.

Self-touch alone is not enough. A controlled re-evaluation of a single-run external self-touch baseline (Yang & Jamone) sits between babbling and the structured families. Reliable contact is necessary, but the policy also needs broad, varied proprioceptive coverage.

MoE beats QDAC on representation quality, even though QDAC won Stage 1. The skill-conditioned actor produces higher per-region targeting accuracy but reaches body regions at higher motor cost and without postural structure. The MoE schemas top the composite ranking; the QDAC schemas occupy the middle.

Reward-peak does not coincide with controllability-peak. On QDAC, SAC return keeps climbing while every discriminative probe traces a U-curve: representation quality climbs through 1M steps, plateaus near 1.5M, and collapses back to baseline by 3M. Reward-greedy checkpoint selection picks a model 1–2M steps past the representational peak. A controllability-grid selector picks inside the peak window.

[E] WHY IT MATTERS

The dominant story in embodied SSL is that progress comes from data scale and architecture. If our reading is right, behavioural composition is the right axis along which to improve embodied representations. Choosing a structured exploration policy delivered a body schema several standard deviations better than a random or babbling one at the same data budget, with no change to architecture, loss, or compute. Designing how the agent moves is a cheaper lever than scaling the data factory.

First author with Mikhail Rudchenko, Yukun Wang, Payton Liao, Kevin Xie. Supervised by Dr. Lorenzo Jamone, UCL.

UCL · LONDON · 2026.05