Why we Don’t Ask “Where” in the First Place

It would be noticed by most people that we usually tend to make decisions that are more based on previous experiences rather than going through an analysis of the situation that we are facing. One may think that this is not the ideal case or we are trapped in the past when it comes to a situation like this. That is we would not think in a complicated way when we are thinking about a problem in the first place unless we collide at a failure. If Hitler thought that the winter in Russia would be devastating for his troops and prepared for an attack according to that, the entire world would’ve been so different, he didn’t consider all that complexity. Similarly, if we were usually thinking about all possible states that we would be in, if we perform a certain action, given the current state, we could’ve done a lot better things. But, why isn’t that the case? Why are we thinking in a more complex way only when we had a failure? Why we are thinking passively when we are involved in a task that repeats over time? This seems to be a basic question, but powerful enough to determine the nature of events that would happen at a future point in time. 

But, what if it is actually due to the way our brain handles this situation? In this topic, the arbitration mechanism between the model-free and model-based learning systems in the brain gives a concise view.

The model-free system is based on previous experiences, based on reward prediction error(RPE). The signals of a model-free system are found in the posterior lateral striatum[1-7]. Here the cached values are obtained from the trial and error experiences in the past. Whereas the model-based(deliberative) system signals are found in regions of the prefrontal and anterior striatum[1-7]. Contrary to the model-based system, the agent uses an internal representation of the environment for calculating the decision on time[8-9].

So, the role of an arbitrator is evident, it has to calculate the degree of reliability of the two systems and have to assign control given a situation. Thus the process evidently involves three steps. Primarily learning, then reliability estimation and the two systems have to compete based on the reliability estimation for getting control over the behavior. 

An agent needs feedback for learning, generally, they are prediction errors. RPE and state prediction error(SPE) are the two main feedbacks used here.

In this process, the agent is making an action At at time point t and the surrounding with which it interacts is called environment. The environment would provide a reward and a new state based on the action of the agent. Despite the mathematical formulation of all these, we can go to the model presented in the 2014 paper by Wan Lee.

It is intuitive that a model-free system would be mediated by RPE whereas a model-based system relies on SPE. RPE calculation is based on immediate reward and requires much less cognitive power. Whereas for SPE the calculation needs a certain state-action state-transition probability. If SPE is higher, that means the model-based system won’t be a good representation of the external world. And if RPE is a higher model-free system is less reliable. In the paper, for calculating SPE, they used a bottom-up Bayesian approach, by calculating the probability of setting SPE to zero much like the t-statistic, by defining reliability(RelMB) as the ratio of mean prediction to that of uncertainty in prediction. For model-free systems also, such a Bayesian approach could be used, however, a much simpler estimation is plausible in that case. 

Once the reliabilities are calculated, the two systems have to compete for each other for setting a probability PMB that determines the degree to which the model-based system dominates the model-free system. Thus we can say that the control of each system over the behavior is dynamically weighted. But, due to the complexity of the model-based system, it is fair to assume that the arbitrator would also consider this trade-off while choosing one system over the other. But, an experiment is needed to verify these intuitions. For that, they used a Markovian Decision task and recorded fMRI while the subjects are involved in the task. There were two kinds of tasks. In the specific goal trials, prior to the starting of the task, it was said that only one color of tokens is redeemable, that is only when the color of the collecting basket matches with that of the coin, in other words, that represents the state. Thus these trials are thought to be giving more importance to SPE and thus model-based systems would be more involved in the behavior. Whereas in the flexible goal trials any color of coins that are collected by the subjects are redeemable. Thus, here the behavior would be led by calculation of immediate reward, thus RPE would be mostly used, and the hence model-free system would be predominantly involved in determining the behavior. 

Randomly generated fractal images are used to represent each state and the reward that one gets from a colored coin is determined by which state the agent is in. Evidently, in the specific goal trials, the reward contingencies would change over time. In a specific goal trial, since the color o0f the collecting box have to meet with that of the coin, it is expected that the RPE would be relatively high.

Thus the model-based system would be more reliable. Specific goal state that was valued was changing from trial to trial. Whereas in the flexible goal trials, there is no need to accurately predicting about a future state since they would get a monetary reward for any color coin. Thus SPE would be high in these trials.

They also manipulated the state transition probabilities. That is there would be trials with high state transition uncertainty(p=(0.5,0.5)) and others in which this uncertainty is very low(p=0.1,0.9)). Here, it is important to note that the chains of states have a finite length. These changes in state-action and state-transition probabilities would have a greater effect on the SPE. If the transition uncertainty is low, it is expected to see a low SPE and vice versa. Thus high uncertainty would favor model-free learning. Thus effectively there were four kinds of trials, each specific goal and flexible goal trial had high state transition uncertainty and low state transition uncertainty within them. And the dynamically weighted model gives the Q factor as :

Q= QMB(s,a)*PMB + QMF*(1-PMB) = QMB*PMB + QMF*PMFWhere QMB is the Q value for the model-based system and QMF is that of the model-free system. A function specifies how good it is for the agent to take action (a) in a state (s) with a policy π.

So, evidently, all these experimental setups are to study the brain activity during the task using fMRI and to see which functions are done by different brain regions and to evaluate which all are involved in the functions of the arbitrator. 

Interestingly it was found that, from the mean reward values that the uncertainty affects the specific goal trials more than the flexible goal trials

Using a generalized linear regression model it is found that there is a statistically significant effect for the goal type and uncertainty. And the effect of state transition uncertainty is higher in specific goal trials than in flexible goal trials. Thus it is fair to say that state transition uncertainty affects two systems differently, the model-based system being affected the most.

    They made six different models of the arbitrator, to see which one performs the best, given the behavioral data. That is essentially to see which among these models would be generating the behavioral output as in the observed form of the data. Evidently, a model that dynamically weights outperforming the other models, based on the BIC measures. The model had a dynamic threshold, implying that behavior would move from model-based to model-free due to the cognitive effort associated with the model-based system. Also, it was observed that a model that used absolute RPE calculation for a model-free system was better than that used a Bayesian framework[11]. 

To further compare the data generated with the observed data to see how many samples had a probability of choosing the right action, predicted by the model versus that had been observed. Further, it was important to check whether the arbitrator is predicting the choice behavior correctly. To do this chunks of observations in which one system is predicted are compared to that of the observations. It is expected that in the model-based system, the behavior would be more flexible on changing reward contingency as the agent is also considering the state transition probabilities for calculating the reliability. 

It is evident from the analysis also that when the model is predicting that the behavior would be under the control of a model-based system the behavior would be best explained by the model-based system and vice versa. 

But, we need to further look into the neural mechanism, that is we need to look at the neural correlation. Initially, they have to show that their results are consistent with previous ones, and the results were consistent. SPE signals are found in dlPFC and intraparietal sulcus. And RPE signals were found in the ventral striatum. They used uncertainty about zero SPE as a measure of reliability, and these signals were negatively correlated with multiple brain areas including dmPFC. Signals of absolute RPE estimations were found in the caudate nucleus. A region of ilPFC was found to be having a correlation to both the signals. And the activity in these areas is better explained by the first best model than the second-best model. ACC was found to be calculating the difference between the reliabilities of the two systems(RelMB-RelMF). 

Further, they looked at the regions that are calculating the value signals QMB and QMF. omPFC  and parts of ACC were found to be correlated with QMB. and in the case of QMF it was SMA, dlPFC, and dmPFC. But, it is also found that when looked for correlations with value evaluation for the model-based and model-free signals, has a correlation with SMA and dmPFC. But, it is already seen that for calculating the value in a weighted manner, both have to be integrated, such a signal was observed in vmPFC. Nex step is to characterize the interaction between the two systems that are involved in the calculation of reliability and comparison. PPI analysis is used to estimate context-dependent changes in effective connectivity between different brain regions. It Identifies brain regions, whose activity depends on psychological context, and the psychological state of the seed region of interest. It generally measures the statistical dependence between two regions of the brain. PPI aims to identify regions whose activity depends on an interaction between psychological factors (the task) and physiological factors (the time course of a region of interest). That is, if two regions of the brain are interacting, the level of correlation between the two regions would increase over time. Thus a linear regression model could be used. Significant negative coupling between ilPFC and regions of the left posterior and mid putamen. A negative coupling between vmPFC and right posterior putamen was also found. But it is found that the arbitrator is working predominantly by acting on the model-free system. And there was strong negative modulation of the coupling between posterior putamen and vmPFC by PMB. supports the hypothesis that model-free value signals are transmitted to vmPFC in order to be combined with model-based values as a precursor to generating choices. As expected it was found that the participants are responding slowly in specific-goal trials, compared to flexible goal trials, and the positive correlation between the probability that the behavior would be under model-based control and RT, and it requires a greater cognitive effort. But, even after including RT as a covariate, the result doesn’t change, suggesting that RT doesn’t explain the fMRI results. It is

also possible that vmPFC is responsible for integrating the model-based and model-free systems. In conclusion, we can see that there is a role for the posterior parts of the putamen in habit learning, probably in computing value signals that are used by model-free reinforcement learning. It was also found that in the case of a model-free system a simple model that keeps track of absolute RPE estimation is more consistent than a Bayesian framework. In short the greater the PE accumulated, the less reliable is the system. 

Another interesting thing is that they couldn’t find any direct interaction between the arbitrator and the regions giving the model-based signals. This is suggestive of a possible bias the arbitrator having towards the model-free system since it is less cognitively expensive, that is preferring it until when it is giving poor predictions. Then there should be a system that is downregulating the value signals evaluated by the model-free system, and it was found that ilPFC is doing this. And it is found that the region of the medial frontal cortex putatively involved in the comparison of reliabilities between the two systems in the rostral cingulate cortex. It is possible that FPC and inferior prefrontal cortex play different roles in implementing the arbitration process and given the putative locus of FPC at the top of the frontal hierarchy. The role of FPC in reliability competition is consistent with a couple of studies[20,21]. Anterior lateral and polar prefrontal cortices may serve a general role in computing estimates about there liability of different control strategies. 

It is important to note that it is entirely feasible that other variables apart from reliability will feed into the arbitration process, such as the time available to render a decision, or the amount of available cognitive resources at a given point in time. The arbitration mechanism undergoes competition on each choice (model-based versus model-free) while fostering collaboration during the transition over trials(model-based->model-free or model-free->model-based). The competition corresponds to the reliability computation, whereas the collaboration corresponds to the dynamics of arbitration(PMB). The RPEs that the model-free system experiences in these trials are the consequence of the choices that are based on the mixture of the model-based and the model-free value[22].

Thus, we had just seen how the arbitration mechanism works and what would be the best possible model that would explain such a system. Thus it is evident that the brain itself relies on the model-free system with less cognitive effort unless the prediction error that it had accumulated more than a limit so that the system is not reliable anymore. Even then it is not solely relying on one system, rather it is weighting both of them dynamically. Thus it is understandable that why people may not be thinking in a more complicated way in the first step itself. And one can also understand why we are becoming so passive when we are doing the same thing over and over, and people sometimes like to do such things which require less cognitive effort. Thus we cannot blame Hitler, for failing against Moscow, because in the first step, the brain won’t be thinking about the transition to a future state, which is a failure, given the current state is winter in Russia.

So, for defining the events that are unfolding, the way brain handles the things matter a lot and this is a typical example of that. 

References:

[1] Balleineand Dickinson, 1998 

[2]Balleine and O’Doherty, 2010

[3]Graybiel,2008

[4]Tricomi et al., 2009

[5]Valentin et al., 2007 

[6]de Wit et al.,2009

[7]Yin et al., 2004, 2005

[8]Daw et al., 2005

[9]Doya et al., 2002

[11] Daw, N.D., Niv, Y., and Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control.Nat. Neurosci.8, 1704–1711

[12] Hampton et al.,2006

[13] Wunderlich et al., 2012, O’Doherty, 2011

[14]Tanaka et al., 2008

[15] Valentinet al., 2007

[16] de Wit et al., 2009

[17] Aron et al., 2003, 2004

[18]Garavan et al., 1999

[19]Tanji and Hoshi,2008

[20] Badreet al., 2012

[21] Boorman et al., 2009

Leave a comment