Human2Bot: Learning Zero-Shot Reward Functions for Robotic Manipulation from Human Demonstrations

Zhejiang University, China
*Corresponding Author

H2B: Zero-shot Human-to-robot Task Transfer in Simulation and Real-World Environments

Abstract

Developing effective reward functions is crucial for robot learning, as they guide behavior and facilitate adaptation to human-like tasks. We present Human2Bot (H2B), advancing the learning of such a generalized multi-task reward function that can be used zero-shot to execute unknown tasks in unseen environments. H2B is a newly designed task similarity estimation model that is trained on a large dataset of human videos. The model determines whether two videos from different environments represent the same task. At test time, the model serves as a reward function, evaluating how closely a robot’s execution matches the human demonstration. While previous approaches necessitate robot-specific data to learn reward functions or policies, our method can learn without any robot datasets. To achieve generalization in robotic environments, we incorporate a domain augmentation process that generates synthetic videos with varied visual appearances resembling simulation environments, alongside a multi-scale inter-frame attention mechanism that aligns human and robot task understanding. Finally, H2B is integrated with Visual Model Predictive Control (VMPC) to perform manipulation tasks in simulation and on the xARM6 robot in real-world settings. Our approach outperforms previous methods in simulated and real-world environments trained solely on human data, eliminating the need for privileged robot datasets.

Introduction

Human demonstration and robot learning

Left: A human demonstration is given (e.g., closing a drawer). Right: The robot learns to deduce the task through its interactions with the environment by executing actions and recording observations. H2B evaluates each sequence of observations based on its similarity to the human demonstration and provides reward on the robot’s performance, guiding it to accomplish the task like the human.

Proposed Method

Vision Dynamics Profiler architecture

The Vision Dynamics Profiler (VDP) takes conditionally augmented video frames as input to produce frame-level features through a pre-trained encoder. The feature reduction and multi-scale inter-frame attention layers further process these features to output a representation for each video. Frame-to-frame cosine similarity is then calculated between the two videos. The Similarity Fusion Network (SFN) processes the similarity vector through a series of 1D-convolution layers to generate a similarity score.

Vision Dynamics Profiler architecture

During the final task execution, we used a single human demonstration of the task and sampled robot trajectories from the environment as input. We evaluate all these trajectories using H2B to determine their similarity to the demonstration, and assigning rewards accordingly. iCEM then optimizes its parameters based on the top-K trajectories by rewards to generate trajectories in the next iteration that closely mimic the demonstrated task.

Experimental Setup and Results

Vision Dynamics Profiler architecture

Training (left): The agent learns a reward function from numerous human videos encompassing various tasks and environments. Inference (right): Evaluation is conducted across diverse simulated environments and real-robot scenarios, both involving tabletop settings with various interactive objects.

Vision Dynamics Profiler architecture

Training Data Diversity

Success rate for three target tasks, showing the impact of increasing non-target training tasks on generalization. The dotted line represents the average success rate.

Ablation study results

Task success rates

We studied the impact of removing the domain augmentation G, the SFN component SFN, and the multi-scale inter-frame attention mechanism fa on robot task success rates. The error bars indicate the standard deviation.

BibTeX

Will be updated soon.