RL is always considered as the most success theory to train an intelligent agent. However, there exists many difficulties need rethinking.

Exploration vs Exploitation

The trade-off between exploration and exploitation is the fundamental problem of reinforcement learning. Inside it, knowledge plays the most important role. The better we can infer the true dynamics with finite observations, the less exploration we need. An extreme case is that the agent has a full observation over all explicit and implicit variables, and the reinforcement learning problem turns out to be a supervised learning problem that can be easily solved just by supervised learning! Another extreme case is that the agent has totally no observation, and it turns out to be a Monte Carlo method.

So where does knowledge come from?

In the standard reinforcement learning settings, reward is the criteria to judge an agent’s policy and value, the curriculum reward, is the long-term goal to maximize. Hence, the manually shaped rewards represent the agent’s prior knowledge. In most cases, RL do experiments on games such as OpenAI gym where the reward is exactly the score agent received, so the reward shaping seems a natural and simple thing. However, it is really hard to determine the reward considering so many factors in real world. Take opening the door as an example, the task is achieved only when the door is open. But is there really no reward when we take out our hand, grasp the handle and pull the door? As the whole process is continuous, the reward must be given in the continuous state space. Actually the agent has a tendency toward a direction in each stage and reward can’t guide the agent appropriately. In another word, reward is a general model but not good enough in many other cases to represent knowledge.

More general cases

In more general cases, reward is too difficult to design. In imitation learning where the expert behavior becomes the knowledge, reward is often hard to reshape and inverse reinforcement learning needs large of MDP calculation in loop. Ignoring the reward and learning from policy to policy may be a more direct way. In hierarchical learning, the task is composed of several subtasks and it is hard to design the composition of rewards to guide the agents to finish jobs in steps. In sparse reward reinforcement learning, it is hard to receive rewards so that the agent usually falls into a local minimum.

What’s more, when the reward is corresponding to time, things get worse. Time is always an important factor for the intelligence, and reinforcement learning only model it in the form of discount. However, the lost of time variables is easy to lead the agent inefficiently hanging out for a long time.

Conclusion

Reinforcement learning tries to reach a trade-off between exploration and exploitation where knowledge plays a main role. Reward is a successful model especially on games but performs not good enough in many real world cases. More models of knowledge need rethinking.