Dataset Information

Humanoids Learning to Walk: A Natural CPG-Actor-Critic Architecture.

ABSTRACT: The identification of learning mechanisms for locomotion has been the subject of much research for some time but many challenges remain. Dynamic systems theory (DST) offers a novel approach to humanoid learning through environmental interaction. Reinforcement learning (RL) has offered a promising method to adaptively link the dynamic system to the environment it interacts with via a reward-based value system. In this paper, we propose a model that integrates the above perspectives and applies it to the case of a humanoid (NAO) robot learning to walk the ability of which emerges from its value-based interaction with the environment. In the model, a simplified central pattern generator (CPG) architecture inspired by neuroscientific research and DST is integrated with an actor-critic approach to RL (cpg-actor-critic). In the cpg-actor-critic architecture, least-square-temporal-difference based learning converges to the optimal solution quickly by using natural gradient learning and balancing exploration and exploitation. Futhermore, rather than using a traditional (designer-specified) reward it uses a dynamic value function as a stability indicator that adapts to the environment. The results obtained are analyzed using a novel DST-based embodied cognition approach. Learning to walk, from this perspective, is a process of integrating levels of sensorimotor activity and value.

SUBMITTER: Li C

PROVIDER: S-EPMC3619089 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Humanoids Learning to Walk: A Natural CPG-Actor-Critic Architecture.

Li Cai C Lowe Robert R Ziemke Tom T

Frontiers in neurorobotics 20130408

The identification of learning mechanisms for locomotion has been the subject of much research for some time but many challenges remain. Dynamic systems theory (DST) offers a novel approach to humanoid learning through environmental interaction. Reinforcement learning (RL) has offered a promising method to adaptively link the dynamic system to the environment it interacts with via a reward-based value system. In this paper, we propose a model that integrates the above perspectives and applies it ...[more]

PMID: 23675345

Similar Datasets

Project description:The human masticatory system is a complex functional unit characterized by a multitude of skeletal components, muscles, soft tissues, and teeth. Muscle activation dynamics cannot be directly measured on live human subjects due to ethical, safety, and accessibility limitations. Therefore, estimation of muscle activations and their resultant forces is a longstanding and active area of research. Reinforcement learning (RL) is an adaptive learning strategy which is inspired by the behavioral psychology and enables an agent to learn the dynamics of an unknown system via policy-driven explorations. The RL framework is a well-formulated closed-loop system where high capacity neural networks are trained with the feedback mechanism of rewards to learn relatively complex actuation patterns. In this work, we are building on a deep RL algorithm, known as the Soft Actor-Critic, to learn the inverse dynamics of a simulated masticatory system, i.e., learn the activation patterns that drive the jaw to its desired location. The outcome of the proposed training procedure is a parametric neural model which acts as the brain of the biomechanical system. We demonstrate the model's ability to navigate the feasible three-dimensional (3D) envelope of motion with sub-millimeter accuracies. We also introduce a performance analysis platform consisting of a set of quantitative metrics to assess the functionalities of a given simulated masticatory system. This platform assesses the range of motion, metabolic efficiency, the agility of motion, the symmetry of activations, and the accuracy of reaching the desired target positions. We demonstrate how the model learns more metabolically efficient policies by integrating a force regularization term in the RL reward. We also demonstrate the inverse correlation between the metabolic efficiency of the models and their agility and range of motion. The presented masticatory model and the proposed RL training mechanism are valuable tools for the analysis of mastication and other biomechanical systems. We see this framework's potential in facilitating the functional analyses aspects of surgical treatment planning and predicting the rehabilitation performance in post-operative subjects.

Project description:Dysfunctional response inhibition, mediated by the striatum and its connections, is thought to underly the clinical manifestations of obsessive-compulsive disorder (OCD). However, the exact neural mechanisms remain controversial. In this study, we undertook a novel approach by positing that a) inhibition is a dynamic construct inherently susceptible to numerous failures, which require error-processing, and b) the actor-critic framework of reinforcement learning can integrate neural patterns of inhibition and error-processing in OCD with their behavioural correlates. We invited nineteen adults with OCD and 21 age-matched healthy controls to perform an fMRI-adjusted stop-signal task. Then, we extracted brain activation and connectivity values regarding distinct task phases in the "actor" and "critic" regions, here corresponding to the caudate's head and dorsal putamen, and midbrain's nuclei (ventral tegmental area and substantia nigra). During response preparation phases of the inhibitory process, individuals with OCD exhibited decreased functional connectivity between the "critic" structures and frontal regions involved in cognitive and executive control. Activity analysis revealed task-related hyperactivation in the midbrain alongside error-processing-specific hyperactivation in the striatum, which was correlated with excessive behavioural slowness, also found in the clinical group. Finally, we identified a remarkable opponency between activity in the ventral tegmental area and caudate leading to direct increases and indirect decreases in symptom severity. We propose a unique "actor-critic"-based domain- and timing-dependent neural profile in OCD, reflecting "harm-avoidant" styles for response suppression, and influencing symptom severity. The dichotomy of hypoconnectivity and hyperactivation in the "critic" along with the opponent relationship between the "actor" and the "critic" in determining symptom severity suggests the implication of neural adaptation mechanisms in OCD with potential relevance for neurobiologically-driven therapies.

Project description:IntroductionThe value approximation bias is known to lead to suboptimal policies or catastrophic overestimation bias accumulation that prevent the agent from making the right decisions between exploration and exploitation. Algorithms have been proposed to mitigate the above contradiction. However, we still lack an understanding of how the value bias impact performance and a method for efficient exploration while keeping stable updates. This study aims to clarify the effect of the value bias and improve the reinforcement learning algorithms to enhance sample efficiency.MethodsThis study designs a simple episodic tabular MDP to research value underestimation and overestimation in actor-critic methods. This study proposes a unified framework called Realistic Actor-Critic (RAC), which employs Universal Value Function Approximators (UVFA) to simultaneously learn policies with different value confidence-bound with the same neural network, each with a different under overestimation trade-off.ResultsThis study highlights that agents could over-explore low-value states due to inflexible under-overestimation trade-off in the fixed hyperparameters setting, which is a particular form of the exploration-exploitation dilemma. And RAC performs directed exploration without over-exploration using the upper bounds while still avoiding overestimation using the lower bounds. Through carefully designed experiments, this study empirically verifies that RAC achieves 10x sample efficiency and 25% performance improvement compared to Soft Actor-Critic in the most challenging Humanoid environment. All the source codes are available at https://github.com/ihuhuhu/RAC.DiscussionThis research not only provides valuable insights for research on the exploration-exploitation trade-off by studying the frequency of policies access to low-value states under different value confidence-bounds guidance, but also proposes a new unified framework that can be combined with current actor-critic methods to improve sample efficiency in the continuous control domain.

Dataset Information

Humanoids Learning to Walk: A Natural CPG-Actor-Critic Architecture.

Publications

Humanoids Learning to Walk: A Natural CPG-Actor-Critic Architecture.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets