S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot
Abstract
We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a per-frame permutation symmetry that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline () fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose Per-Frame Deep Sets (), which performs permutation-invariant pooling within each history frame before temporal readout; we prove that is -invariant and universally approximates continuous -invariant policies. A 2×2 ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the teacher into via DAgger, replacing privileged sphere-state observations with a 16×16 Boolean union contact map, yielding a compact and naturally -invariant tactile representation.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.