Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba, NeurIPS 2017

Remember the sampling approaches used for approximate inference in Bayesian Networks, how the rejection sampling is super expensive since it wastes lot of samples and we try to capitalize on those samples by providing weights in importance sampling. This paper proposes something similar.

In standard RL setting, with sparse reward there can be a long time before the Q-values propagate from the goal state to individual states and even when they do because of sparsity they might not be adequate to differentiate between different states. Popular solution for this problem is to use reward shaping functions but even they have some unforeseen consequences.

This paper highlights that even though the current trajectory $T_i = $ did not reach the achieved goal state $g_i$, it reached the state $s_{t_i}$ and hence if the goal state was $s_{t_i}$ this would have been a useful trajectory. With the advent of goal-conditioned policy learning, policies $\pi$ are no longer learnt for a single goal, rather goal state $g$ is taken as input along with state and action. i.e instead of $\pi:S \times A \rightarrow [0,1]$, goal-conditioned policies are $\pi:S \times A \times S_G \rightarrow [0,1]$. So, we can engineer different goal for trajectories which do not reach their pre-determined goal and add it to the replay buffer with engineered goal state as well as the pre-determined goal. This increases the buffer size, providing more transition samples to learn.

Critique

Although the idea of the using the existing sampled trajectory by engineering the goal seems useful, it is not clear if there exists a principled approach to do this and how is it better than engineering reward-shaping functions?