Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

ABSTRACT: We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

SUBMITTER: Liao P

PROVIDER: S-EPMC10072865 | biostudies-literature | 2022 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Publications

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.

Liao Peng P Qi Zhengling Z Wan Runzhe R Klasnja Predrag P Murphy Susan A SA

Annals of statistics 20221221 6

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is mea ...[more]

PMID: 37022318

Similar Datasets

Learning to maximize reward rate: a model based on semi-Markov decision processes.

Project description:WHEN ANIMALS HAVE TO MAKE A NUMBER OF DECISIONS DURING A LIMITED TIME INTERVAL, THEY FACE A FUNDAMENTAL PROBLEM: how much time they should spend on each decision in order to achieve the maximum possible total outcome. Deliberating more on one decision usually leads to more outcome but less time will remain for other decisions. In the framework of sequential sampling models, the question is how animals learn to set their decision threshold such that the total expected outcome achieved during a limited time is maximized. The aim of this paper is to provide a theoretical framework for answering this question. To this end, we consider an experimental design in which each trial can come from one of the several possible "conditions." A condition specifies the difficulty of the trial, the reward, the penalty and so on. We show that to maximize the expected reward during a limited time, the subject should set a separate value of decision threshold for each condition. We propose a model of learning the optimal value of decision thresholds based on the theory of semi-Markov decision processes (SMDP). In our model, the experimental environment is modeled as an SMDP with each "condition" being a "state" and the value of decision thresholds being the "actions" taken in those states. The problem of finding the optimal decision thresholds then is cast as the stochastic optimal control problem of taking actions in each state in the corresponding SMDP such that the average reward rate is maximized. Our model utilizes a biologically plausible learning algorithm to solve this problem. The simulation results show that at the beginning of learning the model choses high values of decision threshold which lead to sub-optimal performance. With experience, however, the model learns to lower the value of decision thresholds till finally it finds the optimal values.

| S-EPMC4033239 | biostudies-literature

Multi-Objective Markov Decision Processes for Data-Driven Decision Support.

Project description:We present new methodology based on Multi-Objective Markov Decision Processes for developing sequential decision support systems from data. Our approach uses sequential decision-making data to provide support that is useful to many different decision-makers, each with different, potentially time-varying preference. To accomplish this, we develop an extension of fitted-Q iteration for multiple objectives that computes policies for all scalarization functions, i.e. preference functions, simultaneously from continuous-state, finite-horizon data. We identify and address several conceptual and computational challenges along the way, and we introduce a new solution concept that is appropriate when different actions have similar expected outcomes. Finally, we demonstrate an application of our method using data from the Clinical Antipsychotic Trials of Intervention Effectiveness and show that our approach offers decision-makers increased choice by a larger class of optimal policies.

| S-EPMC5179144 | biostudies-literature

Composition of web services using Markov decision processes and dynamic programming.

Project description:We propose a Markov decision process model for solving the Web service composition (WSC) problem. Iterative policy evaluation, value iteration, and policy iteration algorithms are used to experimentally validate our approach, with artificial and real data. The experimental results show the reliability of the model and the methods employed, with policy iteration being the best one in terms of the minimum number of iterations needed to estimate an optimal policy, with the highest Quality of Service attributes. Our experimental work shows how the solution of a WSC problem involving a set of 100,000 individual Web services and where a valid composition requiring the selection of 1,000 services from the available set can be computed in the worst case in less than 200 seconds, using an Intel Core i5 computer with 6 GB RAM. Moreover, a real WSC problem involving only 7 individual Web services requires less than 0.08 seconds, using the same computational power. Finally, a comparison with two popular reinforcement learning algorithms, sarsa and Q-learning, shows that these algorithms require one or two orders of magnitude and more time than policy iteration, iterative policy evaluation, and value iteration to handle WSC problems of the same complexity.

| S-EPMC4385667 | biostudies-other

Comparative effectiveness research on patients with acute ischemic stroke using Markov decision processes.

Project description:BACKGROUND: Several methodological issues with non-randomized comparative clinical studies have been raised, one of which is whether the methods used can adequately identify uncertainties that evolve dynamically with time in real-world systems. The objective of this study is to compare the effectiveness of different combinations of Traditional Chinese Medicine (TCM) treatments and combinations of TCM and Western medicine interventions in patients with acute ischemic stroke (AIS) by using Markov decision process (MDP) theory. MDP theory appears to be a promising new method for use in comparative effectiveness research. METHODS: The electronic health records (EHR) of patients with AIS hospitalized at the 2nd Affiliated Hospital of Guangzhou University of Chinese Medicine between May 2005 and July 2008 were collected. Each record was portioned into two "state-action-reward" stages divided by three time points: the first, third, and last day of hospital stay. We used the well-developed optimality technique in MDP theory with the finite horizon criterion to make the dynamic comparison of different treatment combinations. RESULTS: A total of 1504 records with a primary diagnosis of AIS were identified. Only states with more than 10 (including 10) patients' information were included, which gave 960 records to be enrolled in the MDP model. Optimal combinations were obtained for 30 types of patient condition. CONCLUSION: MDP theory makes it possible to dynamically compare the effectiveness of different combinations of treatments. However, the optimal interventions obtained by the MDP theory here require further validation in clinical practice. Further exploratory studies with MDP theory in other areas in which complex interventions are common would be worthwhile.

| S-EPMC3348070 | biostudies-literature

Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations.

Project description:Control applications often feature tasks with similar, but not identical, dynamics. We introduce the Hidden Parameter Markov Decision Process (HiP-MDP), a framework that parametrizes a family of related dynamical systems with a low-dimensional set of latent factors, and introduce a semiparametric regression approach for learning its structure from data. We show that a learned HiP-MDP rapidly identifies the dynamics of new task instances in several settings, flexibly adapting to task variation.

| S-EPMC5466173 | biostudies-literature

Exchangeable Markov multi-state survival processes.

Project description:We consider exchangeable Markov multi-state survival processes, which are temporal processes taking values over a state-space S , with at least one absorbing failure state b∈S that satisfy the natural invariance properties of exchangeability and consistency under subsampling. The set of processes contains many well-known examples from health and epidemiology including survival, illness-death, competing risk, and comorbidity processes. Here, an extension leads to recurrent event processes. We characterize exchangeable Markov multi-state survival processes in both discrete and continuous time. Statistical considerations impose natural constraints on the space of models appropriate for applied work. In particular, we describe constraints arising from the notion of composable systems. We end with an application to irregularly sampled and potentially censored multi-state survival data, developing a Markov chain Monte Carlo algorithm for inference.

| S-EPMC8547617 | biostudies-literature

Neurotensin in reward processes.

Project description:Neurotensin (NTS) is a neuropeptide neurotransmitter expressed in the central and peripheral nervous systems. Many studies over the years have revealed a number of roles for this neuropeptide in body temperature regulation, feeding, analgesia, ethanol sensitivity, psychosis, substance use, and pain. This review provides a general survey of the role of neurotensin with a focus on modalities that we believe to be particularly relevant to the study of reward. We focus on NTS signaling in the ventral tegmental area, nucleus accumbens, lateral hypothalamus, bed nucleus of the stria terminalis, and central amygdala. Studies on the role of NTS outside of the ventral tegmental area are still in their relative infancy, yet they reveal a complex role for neurotensinergic signaling in reward-related behaviors that merits further study. This article is part of the special issue on 'Neuropeptides'.

| S-EPMC7238864 | biostudies-literature

Psychiatric symptoms influence reward-seeking and loss-avoidance decision-making through common and distinct computational processes.

Project description:AimPsychiatric symptoms are often accompanied by impairments in decision-making to attain rewards and avoid losses. However, due to the complex nature of mental disorders (e.g., high comorbidity), symptoms that are specifically associated with deficits in decision-making remain unidentified. Furthermore, the influence of psychiatric symptoms on computations underpinning reward-seeking and loss-avoidance decision-making remains elusive. Here, we aim to address these issues by leveraging a large-scale online experiment and computational modeling.MethodsIn the online experiment, we recruited 1900 non-diagnostic participants from the general population. They performed either a reward-seeking or loss-avoidance decision-making task, and subsequently completed questionnaires about psychiatric symptoms.ResultsWe found that one trans-diagnostic dimension of psychiatric symptoms related to compulsive behavior and intrusive thought (CIT) was negatively correlated with overall decision-making performance in both the reward-seeking and loss-avoidance tasks. A deeper analysis further revealed that, in both tasks, the CIT psychiatric dimension was associated with lower preference for the options that recently led to better outcomes (i.e. reward or no-loss). On the other hand, in the reward-seeking task only, the CIT dimension was associated with lower preference for recently unchosen options.ConclusionThese findings suggest that psychiatric symptoms influence the two types of decision-making, reward-seeking and loss-avoidance, through both common and distinct computational processes.

| S-EPMC8457174 | biostudies-literature

Psychedelics Reopen the Social Reward Learning Critical Period

Project description:Psychedelics are a broad class of drugs defined by their ability to induce an altered state of consciousness. These drugs have been used for millennia in both spiritual and medicinal contexts, and a number of recent clinical successes have spurred a renewed interest in developing psychedelic therapies. Nevertheless, a unifying mechanism that can account for these shared phenomenological and therapeutic properties remains unknown. Here we demonstrate in mice that the ability to reopen the social reward learning critical period is a shared property across psychedelics. Interestingly, the time course of critical period reopening is proportional to the duration of acute subjective effects reported in humans. Furthermore, the ability to reinstate social reward learning in adulthood is paralleled by metaplastic restoration of oxytocin mediated long-term depression (OT-LTD) in the Nucleus Accumbens (NAc). Finally, identification of differentially expressed genes in the ‘open state’ versus ‘closed state’, provides evidence that reorganization of the extracellular matrix (ECM) is a common downstream mechanism underlying psychedelic mediated critical period reopening. Together these results have significant implications for the implementation of psychedelics in clinical practice, as well as the design of novel compounds for the treatment of neuropsychiatric disease.

2023-06-14 | GSE230679 | GEO

The entropy rate of Linear Additive Markov Processes.

Project description:This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive but simple model able to generate sequences with a given autocorrelation structure. Our research establishes that the theoretical entropy rate of a LAMP model is equivalent to the theoretical entropy rate of the underlying first-order Markov Chain. The LAMP model captures complex relationships and long-range dependencies in data with similar expressibility to a higher-order Markov process. While a higher-order Markov process has a polynomial parameter space, a LAMP model is characterised only by a probability distribution and the transition matrix of an underlying first-order Markov Chain. This surprising result can be explained by the information balance between the additional structure imposed by the next state distribution of the LAMP model, and the additional randomness of each new transition. Understanding the entropy of the LAMP model provides a tool to model complex dependencies in data while retaining useful theoretical results. To emphasise the practical applications, we use the LAMP model to estimate the entropy rate of the LastFM, BrightKite, Wikispeedia and Reuters-21578 datasets. We compare estimates calculated using frequency probability estimates, a first-order Markov model and the LAMP model, also considering two approaches to ensure the transition matrix is irreducible. In most cases the LAMP entropy rates are lower than those of the alternatives, suggesting that LAMP model is better at accommodating structural dependencies in the processes, achieving a more accurate estimate of the true entropy.

| S-EPMC10997120 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data