Author response: The interpretation of computational model parameters depends on the context

Article Figures and data Abstract Editor's evaluation Introduction Results Discussion Materials and methods Appendix 1 Appendix 2 Appendix 3 Appendix 4 Appendix 5 Appendix 6 Appendix 7 Appendix 8 Data availability References Decision letter Author response Article and author information Metrics Abstract Reinforcement Learning (RL) models have revolutionized the cognitive and brain sciences, promising to explain behavior from simple conditioning to complex problem solving, to shed light on developmental and individual differences, and to anchor cognitive processes in specific brain mechanisms. However, the RL literature increasingly reveals contradictory results, which might cast doubt on these claims. We hypothesized that many contradictions arise from two commonly-held assumptions about computational model parameters that are actually often invalid: That parameters generalize between contexts (e.g. tasks, models) and that they capture interpretable (i.e. unique, distinctive) neurocognitive processes. To test this, we asked 291 participants aged 8–30 years to complete three learning tasks in one experimental session, and fitted RL models to each. We found that some parameters (exploration / decision noise) showed significant generalization: they followed similar developmental trajectories, and were reciprocally predictive between tasks. Still, generalization was significantly below the methodological ceiling. Furthermore, other parameters (learning rates, forgetting) did not show evidence of generalization, and sometimes even opposite developmental trajectories. Interpretability was low for all parameters. We conclude that the systematic study of context factors (e.g. reward stochasticity; task volatility) will be necessary to enhance the generalizability and interpretability of computational cognitive models. Editor's evaluation This study adopts a within-participant approach to address two important questions in the field of human reinforcement learning: to what extent do estimated computational model parameters generalize across different tasks and can their meaning be interpreted in the same way in different task contexts? The authors find that inferred parameters show moderate to little generalizability across tasks, and that their interpretation strongly depends on task context. https://doi.org/10.7554/eLife.75474.sa0 Decision letter Reviews on Sciety eLife's review process Introduction In recent decades, cognitive neuroscience has made breakthroughs in computational modeling, demonstrating that reinforcement learning (RL) models can explain foundational aspects of human thought and behavior. RL models can explain not only simple cognitive processes such as stimulus-outcome and stimulus-response learning (Schultz et al., 1997; O'Doherty et al., 2004; Gläscher et al., 2009), but also highly complex processes, including goal-directed, temporally extended behavior (Ribas-Fernandes et al., 2011; Daw et al., 2011), meta-learning (Wang et al., 2018), and abstract problem solving requiring hierarchical thinking (Eckstein and Collins, 2020; Botvinick, 2012; Collins and Koechlin, 2012; Werchan et al., 2016). Underlining their centrality in the study of human cognition, RL models have been applied across the lifespan (van den Bos et al., 2018; Bolenz et al., 2017; Nussenbaum and Hartley, 2019), and in both healthy participants and those experiencing psychiatric illness (Huys et al., 2016; Adams et al., 2016; Hauser et al., 2019; Ahn and Busemeyer, 2016; Deserno et al., 2013). RL models are of particular interest because they also promise a close link to brain function: A specialized network of brain regions, including the basal ganglia and prefrontal cortex, implement computations that mirror specific components of RL algorithms, including action values and reward prediction errors (Frank and Claus, 2006; Niv, 2009; Lee et al., 2012; O'Doherty et al., 2015; Glimcher, 2011; Garrison et al., 2013; Dayan and Niv, 2008). In sum, RL, explaining behavior ranging from simple conditioning to complex problem solving, appropriate for diverse human (and nonhuman) populations, based on a compelling theoretical foundation (Sutton and Barto, 2017), and with strong ties to brain function, has seen a surge in published studies since its introduction (Palminteri et al., 2017), and emerged as a powerful and potentially unifying modeling framework for cognitive and neural processing. Computational modeling enables researchers to condense rich behavioral datasets into simple, falsifiable models (e.g. RL) and fitted model parameters (e.g. learning rate, decision temperature) (van den Bos et al., 2018; Palminteri et al., 2017; Daw, 2011; Wilson and Collins, 2019; Guest and Martin, 2021; Blohm et al., 2020). These models and parameters are often interpreted as a reflection of (or 'window into') cognitive and/or neural processes, with the ability to dissect these processes into specific, unique components, and to measure participants' inherent characteristics along these components. For example, RL models have been praised for their ability to separate the decision making process into value updating and choice selection stages, allowing for the separate investigation of each dimension. Hereby, RL models infer person-specific parameters for each dimension (e.g. learning rate and decision noise), seemingly providing a direct measure of individuals' inherent characteristics. Crucially, many current research practices are firmly based on these (often implicit) assumptions, which give rise to the expectation that parameters have a task- and model-independent interpretation and will seamlessly generalize between studies. However, there is growing—though indirect—evidence that these assumptions might not (or not always) be valid. The following section lays out existing evidence in favor and in opposition of model generalizability and interpretability. Building on our previous opinion piece, which—based on a review of published studies—argued that there is less evidence for model generalizability and interpretability than expected based on current research practices (Eckstein et al., 2021), this study seeks to directly address the matter empirically. Many current research practices are implicitly based on the interpretability and generalizability of computational model parameters (despite the fact that many researchers explicitly distance themselves from them). For our purposes, we define a model variable (e.g. fitted parameter) as generalizable if it is consistent across uses, such that a person would be characterized with the same values independent of the specific model or task used to estimate the variable. Generalizability is a consequence of the assumption that parameters are intrinsic to participants rather than task dependent (e.g. a high learning rate is a personal characteristic that might reflect an individual's unique brain structure). One example of our implicit assumptions about generalizability is the fact that we often directly compare model parameters between studies—for example, comparing our findings related to learning rate parameters to a previous study's findings related to learning rate parameters. Note that such a comparison is only valid if parameters capture the same underlying constructs across studies, tasks, and model variations, that is, if parameters generalize. The literature has implicitly equated parameters in this way in review articles (Huys et al., 2016; Adams et al., 2016; Hauser et al., 2019; Frank and Claus, 2006; Niv, 2009; Lee et al., 2012; O'Doherty et al., 2015; Glimcher, 2011; Dayan and Niv, 2008), meta-analyses (Garrison et al., 2013; Yaple and Yu, 2019; Liu et al., 2016), and also most empirical papers, by relating parameter-specific findings across studies. We also implicitly evoke parameter generalizability when we study task-independent empirical parameter priors (Gershman, 2016), or task-independent parameter relationships (e.g. interplay between different kinds of learning rates [Harada, 2020]), because we presuppose that parameter settings are inherent to participants, rather than task specific. We define a model variable as interpretable if it isolates specific and unique cognitive elements, and/or is implemented in separable and unique neural substrates. Interpretability follows from the assumption that the decomposition of behavior into model parameters 'carves cognition at its joints', and provides fundamental, meaningful, and factual components (e.g. separating value updating from decision making). We implicitly invoke interpretability when we tie model variables to neural substrates in a task-general way (e.g. reward prediction errors to dopamine function [Schultz and Dickinson, 2000]), or when we use parameters as markers of psychiatric conditions in a model-independent way (e.g. working-memory deficits in schizophrenia [Collins et al., 2014]). Interpretability is also required when we relate abstract parameters to aspects of real-world decision making (Heinz et al., 2017), and generally, when we assume that model variables are particularly 'theoretically meaningful' (Huys et al., 2016). However, in the midst of the growing application of computational modeling of behavior, the focus has also shifted toward inconsistencies and apparent contradictions in the emerging literature, which are becoming apparent in cognitive (Nassar and Frank, 2016), developmental (Nussenbaum and Hartley, 2019; Javadi et al., 2014; Blakemore and Robbins, 2012; DePasque and Galván, 2017), clinical (Adams et al., 2016; Hauser et al., 2019; Ahn and Busemeyer, 2016; Deserno et al., 2013), and neuroscience studies (Garrison et al., 2013; Yaple and Yu, 2019; Liu et al., 2016; Mohebi et al., 2019), and have recently become the focus of targeted investigations (Robinson and Chase, 2017; Weidinger et al., 2019; Brown et al., 2020; Pratt et al., 2021). For example, some developmental studies have shown that learning rates increased with age (Master et al., 2020; Davidow et al., 2016), whereas others have shown that they decrease (Decker et al., 2015). Yet others have reported U-shaped trajectories with either peaks (Rosenbaum et al., 2020) or troughs (Eckstein et al., 2022) during adolescence, or stability within this age range (Palminteri et al., 2016) (for a comprehensive review, see Nussenbaum and Hartley, 2019; for specific examples, see Nassar and Frank, 2016). This is just one striking example of inconsistencies in the cognitive modeling literature, and many more exist (Eckstein et al., 2022). These inconsistencies could signify that computational modeling is fundamentally flawed or inappropriate to answer our research questions. Alternatively, inconsistencies could signify that the method is valid, but our current implementations are inappropriate (Palminteri et al., 2017; Uttal, 1990; Webb, 2001; Navarro, 2019; Yarkoni, 2020; Wilson and Collins, 2019). However, we hypothesize that inconsistencies can also arise for a third reason: Even if both method and implementation are appropriate, inconsistencies like the ones above are expected—and not a sign of failure—if implicit assumptions of generalizability and interpretability are not always valid. For example, model parameters might be more context-dependent and less person-specific than we often appreciate (Nussenbaum and Hartley, 2019; Nassar and Frank, 2016; Yaple and Yu, 2019; Behrens et al., 2007; McGuire et al., 2014). To illustrate this point, the current project began as an investigation into the development of learning in adolescence, with the aim of combining the insights of three different learning tasks to gain a more complete understanding of the underlying mechanisms. However, even though each task individually showed strong and interesting developmental patterns in terms of model parameters (Master et al., 2020; Eckstein et al., 2022; Xia et al., 2021), these patterns were very different—and even contradictory—across tasks. This implied that specific model parameters (e.g. learning rate) did not necessarily isolate specific cognitive processes (e.g. value updating) and consistently measure individuals on these processes, but that they captured different processes depending on the learning context of the task (lack of generalizability). In addition, the processes identified by one parameter were not necessarily distinct from the cognitive processes (e.g. decision making) identified by other parameters (e.g. decision temperature), but could overlap between parameters (lack of interpretability). In a nutshell, the 'same' parameters seemed to measure something different in each task. The goal of the current project was to assess these patterns formally: We determined the degree to which parameters generalized between three different RL tasks, investigated whether parameters were interpretable as unique and specific processes, and provide initial evidence for context factors that potentially modulate generalizability and interpretability of model parameters, including feedback stochasticity, task volatility, and memory demands. To this aim, we compared the same individuals' RL parameters, fit to different learning tasks in a single study, in a developmental dataset (291 participants, ages 8–30 years). Using a developmental dataset had several advantages: It provided large between-participant variance and hence better coverage of the parameter space, and allowed us to specifically target outstanding discrepancies in the developmental psychology literature (Nussenbaum and Hartley, 2019). The three learning tasks we used varied on several common dimensions, including feedback stochasticity, task volatility, and memory demands (Figure 1E), and have been used previously to study RL processes (Davidow et al., 2016; Collins and Frank, 2012; Javadi et al., 2014; Master et al., 2020; Eckstein et al., 2022; Xia et al., 2021). However, like many tasks in the literature, the tasks likely also engaged other cognitive processes besides RL, such as working memory and reasoning. The within-participant design of our study allowed us to test directly whether the same participants showed the same parameters across tasks (generalizability), and the combination of multiple tasks shed light on which cognitive processes the same parameters captured in each task (interpretability). We extensively compared and validated all RL models (Palminteri et al., 2017; Wilson and Collins, 2019; Lee, 2011) and have reported each task's unique developmental results separately (Master et al., 2020; Eckstein et al., 2022; Xia et al., 2021). Figure 1 Download asset Open asset Overview of the experimental paradigm. (A) Participant sample. Left: Number of participants in each age group, broken up by sex (self-reported). Age groups were determined by within-sex age quartiles for participants between 8–17 years (see Eckstein et al., 2022 for details) and 5 year bins for adults. Right: Number of participants whose data were excluded because they failed to reach performance criteria in at least one task. (B) Task A procedure of ('Butterfly task'). Participants saw one of four butterflies on each trial and selected one of two flowers in response, via button press on a game controller. Each butterfly had a stable preference for one flower throughout the task, but rewards were delivered stochastically (70% for correct responses, 30% for incorrect). For details, see section 'Task design' and the original publication (Xia et al., 2021). (C) Task B Procedure ('Stochastic Reversal'). Participants saw two boxes on each trial and selected one with the goal of finding gold coins. At each point in time, one box was correct and had a high (75%) probability of delivering a coin, whereas the other was incorrect (0%). At unpredictable intervals, the correct box switched sides. For details, see section 'Task design' and Eckstein et al., 2022. (D) Task C procedure ('Reinforcement learning-working memory'). Participants saw one stimulus on each trial and selected one of three buttons (A1-A3) in response. All correct and no incorrect responses were rewarded. The task contained blocks of 2–5 stimuli, determining its 'set size'. The task was designed to disentangle set size-sensitive working memory processes from set size-insensitive RL processes. For details, see section 'Task design' and Master et al., 2020. (E) Pairwise similarities in terms of experimental design between tasks A (Xia et al., 2021), B (Eckstein et al., 2022), and C (Master et al., 2020). Similarities are shown on the arrows connecting two tasks; the lack of a feature implies a difference. E.g., a 'Stable set size' on tasks A and B implies an unstable set size in task C. Overall, task A shared more similarities with tasks B and C than these shared with each other. (F) Summary of the computational models for each task (for details, see section 'Computational models' and original publications). Each row shows one model, columns show model parameters. 'Y' (yes) indicates that a parameter is present in a given model, '—' indicates that a parameter is not present. '1β and ϵ' refer to exploration / noise parameters; α+ (α-) to learning rate for positive (negative) outcomes; 'Persist. P' to persistence; 'WM pars'. to working memory parameters. Our results show a striking lack of generalizability and interpretability for some tasks and parameters, but convincing generalizability for others. This reveals an urgent need for future research to address the role of context factors in computational modeling, and reveals the necessity of taking context factors into account when interpreting and generalizing results. It also suggests that some prior discrepancies are likely explained by differences in context. Results This section gives a brief overview of the experimental tasks (Figure 1B–D) and computational models (Figure 1F; also see sections 'Task Design', 'Computational Models', and 'Appendix 2'; for details, refer to original publications [Master et al., 2020; Eckstein et al., 2022; Xia et al., 2021]). We then show our main findings on parameter generalizability (section 'Part I: parameter generalizability') and interpretability (section 'Part II: parameter interpretability'). All three tasks are learning tasks and have been previously well-captured by RL models, yet with differences in parameterization (Javadi et al., 2014; Davidow et al., 2016; Collins and Frank, 2012). In our study as well, the best-fitting RL models differed between tasks, containing some parameters that were the same across tasks, and some that were task-specific (Figure 1F). Thus, our setup provides a realistic reflection of the diversity of computational models in the literature. Task A required participants to learn the correct associations between each of four stimuli (butterflies) and two responses (flowers) based on probabilistic feedback (Figure 1B). The best-fitting model contained three free parameters: learning rate from positive outcomes α+, inverse decision temperature β, and forgetting F. It also contained one fixed parameter: learning rate from negative outcomes α-=0 (Xia et al., 2021). Task B required participants to adapt to unexpected switches in the action-outcome contingencies of a simple bandit task (only one of two boxes contained a gold coin at any time) based on semi-probabilistic feedback (Figure 1C). The best-fitting RL model contained four free parameters: α+, α-, β, and choice persistence p (Eckstein et al., 2022). Task C required learning of stimulus-response associations like task A, but over several task blocks with varying numbers of stimuli, and using deterministic feedback (Figure 1D). The best model for this task combined RL and working memory mechanisms, containing RL parameters α+ and α-; working memory parameters capacity K, forgetting F, and noise ϵ; and mixture parameter ρ, which determined the relative weights of RL and working memory (Master et al., 2020; Collins and Frank, 2012). The Markov decision process (MDP) framework provides a common language to describe learning tasks like ours, by breaking them down into states, actions, and reward functions. Appendix 2 summarizes the tasks in this way and highlights major differences. We employed rigorous model fitting, comparison, and validation to obtain the best-fitting models presented here (see Appendix 4 and Palminteri et al., 2017; Daw, 2011; Wilson and Collins, 2019; Lee, 2011): For each task, we compared a large number of competing models, based on different parameterizations and cognitive mechanisms, and selected the best one based on quantitative model comparison as as the models' to participants' behavior in We also used hierarchical methods for model and comparison to obtain the most parameter 2011; Brown et al., 2020). publications provide on the set of models compared and the that the models presented here are the best-fitting ones for each task (Master et al., 2020; Eckstein et al., 2022; Xia et al., 2021), an important for the that individual parameters are This validation for each dataset that parameter discrepancies between tasks are not to a lack of modeling and can provide information about parameter generalizability and interpretability. we that no model is I: parameter generalizability Crucially, the parameter inconsistencies in previous literature could be by differences between studies (e.g. modeling research Our within-participant design us to these out by whether the same participants show different parameter values when using different tasks; this finding would be strong evidence for the hypothesized lack of parameter To assess this, we determined whether participants showed similar parameter values across tasks, and then whether tasks showed similar parameter age trajectories. in parameter values We used of variance to test for task on parameter values (Figure showed significant task we followed up with to compare each of tasks, using the Figure 2 Download asset Open asset Generalizability of parameter values and of parameter age trajectories / parameters (A) parameters over age for all three tasks values differed significantly between tasks; show the of the main of task on parameters of the participants in each age (for see Figure the for of the (B) Summary of the main results of arrows connecting tasks are from Figure and task large connecting parameters between tasks show test significant task differences in would the lack of All were that parameter values differed between each of tasks. (C) age trajectories, that is, parameters over Age trajectories similarities that are by differences in or in values show significant of task on age trajectories (D) Summary of the main results of C. connecting parameters between tasks show of models each parameter from the parameter in a different task significant and a lack In to parameter age trajectories were predictive in several for tasks with more similarities and A and compared to tasks with and Learning rates α+ and were across tasks that they separate were very low in task C in task A was fixed at but high in task B for comparison, see Decision noise was high in task B but low in tasks A and C in 1 because its values were not to to the different see section 'Computational was significantly in task C than A Task B was best fit 1 of parameter values from task an showed a significant task we followed up with / For all parameters, parameter values hence differed between tasks. This shows that the three tasks significantly different of learning rates, decision and forgetting for the same participants (Figure these parameter differences specific task Learning rates and were in task switches required updating and high of forgetting was in task which the memory demands. Using models that for age of to similar results Appendix parameter differences However, comparing parameters in terms of their values has because it the role of relative variance between participants, which participants' relationships to each and might be an important of parameters. To test whether parameters generalized in rather than we parameters between each of tasks, using both α+ and parameters were significantly between tasks A and B as as between tasks A and C. were between tasks B and C. This suggests that both α+ and generalized in terms of the relationships they captured between this generalization was only between tasks A and B or A and potentially to the fact that task A was more similar to tasks B and C than these were to each other (Figure also see section of Appendix 3 shows the between all of in the dataset parameters and behavioral Note that noise parameters generalized between tasks A and C differences in parameterization in the of choice 1B). age trajectories This is in its to account for an of variance in our This that apparent parameter generalization could be by a common on rather than underlying To address this, we on parameter age trajectories, to differences between tasks that are potentially (e.g. and patterns that are potentially more (e.g. of participants' values relative to each Age trajectories were by each parameter within each task (Figure To test for differences, was used to parameters of all tasks from two age and and task or A better fit of this model compared to the one task indicates that task characteristics age trajectories. In this we followed up with models comparing individual of tasks. For α-, the model showed a significantly better an of task on age showed fundamentally different trajectories in task B compared to C task A, was In task by a U-shaped of but in task it increased by an Figure The fact that these patterns are of each other was in the significant terms of the model we previously reported a

How to cite this publication

Maria K. Eckstein, Sarah L. Master, Liyu Xia, Ronald E Dahl, Linda Wilbrecht, Anne Collins (2022). Author response: The interpretation of computational model parameters depends on the context. , DOI: https://doi.org/10.7554/elife.75474.sa2.

Related publications

Why join Raw Data Library?

Quality

Datasets shared by verified academics with rich metadata and previews.

Control

Authors choose access levels; downloads are logged for transparency.

Free for Academia

Students and faculty get instant access after verification.

Publication Details

Type

Preprint

Year

2022

Authors

Datasets

Total Files

Language

DOI

https://doi.org/10.7554/elife.75474.sa2

Join Research Community

Access datasets from 50,000+ researchers worldwide with institutional verification.

Get Free Access

Author response: The interpretation of computational model parameters depends on the context

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

Author response: The interpretation of computational model parameters depends on the context

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

Access Research Data

This PDF is not available in different languages.

Ronald E Dahl

Abstract

How to cite this publication

Related publications

Why join Raw Data Library?

Quality

Control

Free for Academia

Publication Details

Join Research Community