Robotics: Science and Systems (RSS) 2026
I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang, Connor Schenck
CLAMP is a 3D pre-training framework for robotic manipulation that learns image and action representations from large-scale simulated robot trajectories via contrastive learning. From RGB-D images and camera extrinsics, it builds a merged point cloud and re-renders multi-view four-channel observations (depth + 3D coordinates), including dynamic wrist views, to give clearer views of target objects for high-precision tasks. The pre-trained encoders, combined with a Diffusion Policy initialized during pre-training, are fine-tuned on a small number of task demonstrations and outperform state-of-the-art baselines across six simulated and five real-world tasks.