Planning
In the previous chapter of the book, Monte Carlo and Temporal Difference were introduced, where real experience is used to learn a prediction model for return. As the training of a value function / policy function is directly based on state transition and the reward in the real world, it is called Direct Reinforcement Learning, or Modeless Learning). An alternative is to add an environment model to emulate the environment for state transition and reward. The emulated transition can also be used by training the prediction model for return. This is called Planning or Model-based Learning.
Train on Randomly Sampled past observations
Since we have a Model for state transition, does that mean all the prediction model for return is trained solely using emulated state transition? The answer is no. The training usually happens with a state transition in the real world, followed by training using n trials of emulated state transitions. The n trials were randomly picked from the previously observed states.
Prioritized Sweeping
Note that most of the observed state transitions will have 0 reward and very small amount of change in return . It is inefficient to train at these states. As a result, randomly picking the state transition for training is not plausible in a large state space. An improvement is by using a priority queue to order the real experience (state, action) pairs: by putting the (state, action) pair with largest errors in the front.