DNQ: Applying RL in DNNs

When we are considering the learning model, reinforcement learning is an important theme to be considered. The maze experiment with rodent had already shown us the existence of such a system in animals. But, the latest researches, such as DQNs, had pushed it further in AI also. To build a smart in silico model, we need to equip it with something like reinforcement learning.

In this context, “Human-level control through deep reinforcement learning” by Mnih et al. (http://www.nature.com/doifinder/10.1038/nature14236) explains a better method. It is about how the agents are optimizing their response to the environment. When an agent is confronted with difficulty in a real-world scenario, they must gain information from a high dimensional sensory space. It seems that animals can solve this problem by combining reinforcement learning and a hierarchical sensory information processing system. Before DQN, the reinforcement models had fewer successes, which was also confined to low dimensional state space. Here, the authors create a deep neural network called the Deep Q Net to optimize policies based on information available from a high dimensional sensory space. They had proved its efficiency from the score that DQN obtained by playing Atari games, almost as smart as a human agent.

Why are we talking about DNQ in the first place? It is that it was a model that combines reinforcement learning with deep neural networks. And it’s already known that ANNs are very good at handling high dimensional data.

If you are asking why deep learning, the plot will be sufficient to give you the answer. After some point for conventional learning algorithms, learning performance won’t improve after a specific amount of data. But, if you are using deep neural networks, it would still increase. Improvement in learning implies more accuracy in prediction.

CNN’s which are widely used in computational neuroscience, is inspired by the feed-forward information processing system in the visual cortex proposed by Hubel and Wiesel. It can exploit local spatial correlation and be robust when there is a translation for the field we observe.

For our agent, information is a sequence of images. It will analyze it to take any action that will maximize the future reward.

More formally, we are using the deep convolutional neural net to approximate the optimal action-value given above. It is the sum of all reward terms discounted by a ratio game, which could be associated with latency in taking action. This is achieved by a behavioral policy P(a|s), where a is the action, and a is the action that the agent had taken given the information s. Before going further we need to know what is the issue in using neural networks in approximating such a function. We can say that there might be correlations present in the observations. Then, small updates that are added to Q may sometimes change the final action significantly. That is neural networks are a good example of nonlinear functions, they are said to be unstable or even diverging in this case. This ultimately results in a significant difference in action values that it i9s giving and the target values. 

The way the authors had solved this issue is actually interesting. They used a variant of the Q learning, which has basically two important aspects. The first one is the experience replay mechanism inspired by biological systems. It stores some transitions sampled from later for the agent to learn from it, that is it stores the discovered [state, action, reward, next_state] separately, which are not associated. This raw data is later fed into the action value calculator. Here it is important to note that learning is logically separated from gaining experiences. The randomization process removes the correlation between the observations(s). One may actually note that this in fact is very similar to stochastic gradient descent. And random samples are taken from the stored transition values. But, entirely separating these two processes won’t give us a better model, as what we ultimately want is a model that learns from the actions and uses that to optimize the future action so that Q*(s, a) will be improved. We can do this splitting in any way we like to. Suppose you are in the nth step, then you can learn from a randomly selected few of the all n-1 transitions preceding it. Turns out that this can give more efficient learning from previous experiences. But, one may still ask why? It is just that, if we are considering real-world cases gaining new experience could be costly, so, running multiple times through the same set of experiences could be more efficient. 

Secondly, they used an update mechanism, that adjusts action values(Q) towards target values that are only periodically updated, this reduces correlation with the target values. And these two points are making the model better than the previous ones. The others need a much larger dataset and have to iterate the learning process multiple times, thus making it computationally expensive. 

They parametrized the approximate value function Q(s,a, 𝜃i) where 𝜃i is the parameter or simply weights of the Q network at iteration i. As we had explained above, we store experiences at specific time step t given by et=(st,at,rt,st+1) in a dataset Dt={e1,…,et}. During learning, we apply Q on a minibatch selected uniformly from Dt at the tth step. Here in the ith step, it uses the loss function:

The Q value multiplied by gama is the value of the target obtained at the ith iteration. The target parameter is updated by the parameter we are obtaining in every C step and is held constant during individual updates. 

But, more importantly, we need to validate that this is working. For that, the authors have used the Atari2600 platform which has a diverse set of tasks. They used the same architecture that we have already explained for playing this. The data is a high dimensional array in the sense that it is 210×160 color video with a frequency 60Hz. It turns out that the model worked well in training a large neural net efficiently as we can see from the points obtained at each epoch and the Q values. 

                                                                               Adapted from: Minh et al

So, from the results, it is conclusive enough to say that DQNs are successful in implementing RL in deep neural networks to make a further improvement towards making a better in silico model of behavior.

Leave a comment