Developing effective reward functions is crucial for robot learning, as they guide behavior and facilitate adaptation to human-like tasks. We present Human2Bot (H2B), advancing the learning of such a generalized multi-task reward function that can be used zero-shot to execute unknown tasks in unseen environments. H2B is a newly designed task similarity estimation model that is trained on a large dataset of human videos. The model determines whether two videos from different environments represent the same task. At test time, the model serves as a reward function, evaluating how closely a robot’s execution matches the human demonstration. While previous approaches necessitate robot-specific data to learn reward functions or policies, our method can learn without any robot datasets. To achieve generalization in robotic environments, we incorporate a domain augmentation process that generates synthetic videos with varied visual appearances resembling simulation environments, alongside a multi-scale inter-frame attention mechanism that aligns human and robot task understanding. Finally, H2B is integrated with Visual Model Predictive Control (VMPC) to perform manipulation tasks in simulation and on the xARM6 robot in real-world settings. Our approach outperforms previous methods in simulated and real-world environments trained solely on human data, eliminating the need for privileged robot datasets.
Left: A human demonstration is given (e.g., closing a drawer). Right: The robot learns to deduce the task through its interactions with the environment by executing actions and recording observations. H2B evaluates each sequence of observations based on its similarity to the human demonstration and provides reward on the robot’s performance, guiding it to accomplish the task like the human.
The Vision Dynamics Profiler (VDP) takes conditionally augmented video frames as input to produce frame-level features through a pre-trained encoder. The feature reduction and multi-scale inter-frame attention layers further process these features to output a representation for each video. Frame-to-frame cosine similarity is then calculated between the two videos. The Similarity Fusion Network (SFN) processes the similarity vector through a series of 1D-convolution layers to generate a similarity score.
During the final task execution, we used a single human demonstration of the task and sampled robot trajectories from the environment as input. We evaluate all these trajectories using H2B to determine their similarity to the demonstration, and assigning rewards accordingly. iCEM then optimizes its parameters based on the top-K trajectories by rewards to generate trajectories in the next iteration that closely mimic the demonstrated task.
Training (left): The agent learns a reward function from numerous human videos encompassing various tasks and environments. Inference (right): Evaluation is conducted across diverse simulated environments and real-robot scenarios, both involving tabletop settings with various interactive objects.
Success rate for three target tasks, showing the impact of increasing non-target training tasks on generalization. The dotted line represents the average success rate.
We studied the impact of removing the domain augmentation G, the SFN component SFN, and the multi-scale inter-frame attention mechanism fa on robot task success rates. The error bars indicate the standard deviation.
Will be updated soon.