Project description:The combination of reinforcement learning with deep learning is a promising approach to tackle important sequential decision-making problems that are currently intractable. One obstacle to overcome is the amount of data needed by learning systems of this type. In this article, we propose to address this issue through a divide-and-conquer approach. We argue that complex decision problems can be naturally decomposed into multiple tasks that unfold in sequence or in parallel. By associating each task with a reward function, this problem decomposition can be seamlessly accommodated within the standard reinforcement-learning formalism. The specific way we do so is through a generalization of two fundamental operations in reinforcement learning: policy improvement and policy evaluation. The generalized version of these operations allow one to leverage the solution of some tasks to speed up the solution of others. If the reward function of a task can be well approximated as a linear combination of the reward functions of tasks previously solved, we can reduce a reinforcement-learning problem to a simpler linear regression. When this is not the case, the agent can still exploit the task solutions by using them to interact with and learn about the environment. Both strategies considerably reduce the amount of data needed to solve a reinforcement-learning problem.
Project description:Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: https://github.com/wabbajack1/TAPD .
Project description:Generating novel valid molecules is often a difficult task, because the vast chemical space relies on the intuition of experienced chemists. In recent years, deep learning models have helped accelerate this process. These advanced models can also help identify suitable molecules for disease treatment. In this paper, we propose Taiga, a transformer-based architecture for the generation of molecules with desired properties. Using a two-stage approach, we first treat the problem as a language modeling task of predicting the next token, using SMILES strings. Then, we use reinforcement learning to optimize molecular properties such as QED. This approach allows our model to learn the underlying rules of chemistry and more easily optimize for molecules with desired properties. Our evaluation of Taiga, which was performed with multiple datasets and tasks, shows that Taiga is comparable to, or even outperforms, state-of-the-art baselines for molecule optimization, with improvements in the QED ranging from 2 to over 20 percent. The improvement was demonstrated both on datasets containing lead molecules and random molecules. We also show that with its two stages, Taiga is capable of generating molecules with higher biological property scores than the same model without reinforcement learning.
Project description:From the computational point of view, musculoskeletal control is the problem of controlling high degrees of freedom and dynamic multi-body system that is driven by redundant muscle units. A critical challenge in the control perspective of skeletal joints with antagonistic muscle pairs is finding methods robust to address this ill-posed nonlinear problem. To address this computational problem, we implemented a twofold optimization and learning framework to be specialized in addressing the redundancies in the muscle control . In the first part, we used model predictive control to obtain energy efficient skeletal trajectories to mimick human movements. The second part is to use deep reinforcement learning to obtain a sequence of stimulus to be given to muscles in order to obtain the skeletal trajectories with muscle control. We observed that the desired stimulus to muscles is only efficiently constructed by integrating the state and control input in a closed-loop setting as it resembles the proprioceptive integration in the spinal cord circuits. In this work, we showed how a variety of different reference trajectories can be obtained with optimal control and how these reference trajectories are mapped to the musculoskeletal control with deep reinforcement learning. Starting from the characteristics of human arm movement to obstacle avoidance experiment, our simulation results confirm the capabilities of our optimization and learning framework for a variety of dynamic movement trajectories. In summary, the proposed framework is offering a pipeline to complement the lack of experiments to record human motion-capture data as well as study the activation range of muscles to replicate the specific trajectory of interest. Using the trajectories from optimal control as a reference signal for reinforcement learning implementation has allowed us to acquire optimum and human-like behaviour of the musculoskeletal system which provides a framework to study human movement in-silico experiments. The present framework can also allow studying upper-arm rehabilitation with assistive robots given that one can use healthy subject movement recordings as reference to work on the control architecture of assistive robotics in order to compensate behavioural deficiencies. Hence, the framework opens to possibility of replicating or complementing labour-intensive, time-consuming and costly experiments with human subjects in the field of movement studies and digital twin of rehabilitation.
Project description:We present a general, two-stage reinforcement learning approach to create robust policies that can be deployed on real robots without any additional training using a single demonstration generated by trajectory optimization. The demonstration is used in the first stage as a starting point to facilitate initial exploration. In the second stage, the relevant task reward is optimized directly and a policy robust to environment uncertainties is computed. We demonstrate and examine in detail the performance and robustness of our approach on highly dynamic hopping and bounding tasks on a quadruped robot.
Project description:This study investigated the trajectory-planning problem of a six-axis robotic arm based on deep reinforcement learning. Taking into account several characteristics of robot motion, a multi-objective optimization approach is proposed, which was based on the motivations of deep reinforcement learning and optimal planning. The optimal trajectory was considered with respect to multiple objectives, aiming to minimize factors such as accuracy, energy consumption, and smoothness. The multiple objectives were integrated into the reinforcement learning environment to achieve the desired trajectory. Based on forward and inverse kinematics, the joint angles and Cartesian coordinates were used as the input parameters, while the joint angle estimation served as the output. To enable the environment to rapidly find more-efficient solutions, the decaying episode mechanism was employed throughout the training process. The distribution of the trajectory points was improved in terms of uniformity and smoothness, which greatly contributed to the optimization of the robotic arm's trajectory. The proposed method demonstrated its effectiveness in comparison with the RRT algorithm, as evidenced by the simulations and physical experiments.
Project description:In recent years, mounting evidence from animal models and studies in humans has accumulated for the role of cardiovascular exercise (CE) in improving motor performance and learning. Both CE and motor learning may induce highly dynamic structural and functional brain changes, but how both processes interact to boost learning is presently unclear. Here, we hypothesized that subjects receiving CE would show a different pattern of learning-related brain plasticity compared to non-CE controls, which in turn associates with improved motor learning. To address this issue, we paired CE and motor learning sequentially in a randomized controlled trial with healthy human participants. Specifically, we compared the effects of a 2-week CE intervention against a non-CE control group on subsequent learning of a challenging dynamic balancing task (DBT) over 6 consecutive weeks. Structural and functional MRI measurements were conducted at regular 2-week time intervals to investigate dynamic brain changes during the experiment. The trajectory of learning-related changes in white matter microstructure beneath parieto-occipital and primary sensorimotor areas of the right hemisphere differed between the CE vs. non-CE groups, and these changes correlated with improved learning of the CE group. While group differences in sensorimotor white matter were already present immediately after CE and persisted during DBT learning, parieto-occipital effects gradually emerged during motor learning. Finally, we found that spontaneous neural activity at rest in gray matter spatially adjacent to white matter findings was also altered, therefore indicating a meaningful link between structural and functional plasticity. Collectively, these findings may lead to a better understanding of the neural mechanisms mediating the CE-learning link within the brain.
Project description:BackgroundSepsis is one of the most life-threatening medical conditions. Therefore, many clinical trials have been conducted to identify optimal treatment strategies for sepsis. However, finding reliable strategies remains challenging due to limited-scale clinical tests. Here we tried to extract the optimal sepsis treatment policy from accumulated treatment records.MethodsIn this study, with our modified deep reinforcement learning algorithm, we stably generated a patient treatment artificial intelligence model. As training data, 16,744 distinct admissions in tertiary hospitals were used and tested with separate datasets. Model performance was tested by t test and visualization of estimated survival rates. We also analyze model behavior using the confusion matrix, important feature extraction by a random forest decision tree, and treatment behavior comparison to understand how our treatment model achieves high performance.ResultsHere we show that our treatment model's policy achieves a significantly higher estimated survival rate (up to 10.03%). We also show that our models' vasopressor treatment was quite different from that of physicians. Here, we identify that blood urea nitrogen, age, sequential organ failure assessment score, and shock index are the most different factors in dealing with sepsis patients between our model and physicians.ConclusionsOur results demonstrate that the patient treatment model can extract potential optimal sepsis treatment policy. We also extract core information about sepsis treatment by analyzing its policy. These results may not apply directly in clinical settings because they were only tested on a database. However, they are expected to serve as important guidelines for further research.
Project description:Recent studies have shown that combining Transformer and conditional strategies to deal with offline reinforcement learning can bring better results. However, in a conventional reinforcement learning scenario, the agent can receive a single frame of observations one by one according to its natural chronological sequence, but in Transformer, a series of observations are received at each step. Individual features cannot be extracted efficiently to make more accurate decisions, and it is still difficult to generalize effectively for data outside the distribution. We focus on the characteristic of few-shot learning in pre-trained models, and combine prompt learning to enhance the ability of real-time policy adjustment. By sampling the specific information in the offline dataset as trajectory samples, the task information is encoded to help the pre-trained model quickly understand the task characteristics and the sequence generation paradigm to quickly adapt to the downstream tasks. In order to understand the dependencies in the sequence more accurately, we also divide the fixed-size state information blocks in the input trajectory, extract the features of the segmented sub-blocks respectively, and finally encode the whole sequence into the GPT model to generate decisions more accurately. Experiments show that the proposed method achieves better performance than the baseline method in related tasks, can be generalized to new environments and tasks better, and effectively improves the stability and accuracy of agent decision making.
Project description:Diffusion in alloys is an important class of atomic processes. However, atomistic simulations of diffusion in chemically complex solids are confronted with the timescale problem: the accessible simulation time is usually far shorter than that of experimental interest. In this work, long-timescale simulation methods are developed using reinforcement learning (RL) that extends simulation capability to match the duration of experimental interest. Two special limits, RL transition kinetics simulator (TKS) and RL low-energy states sampler (LSS), are implemented and explained in detail, while the meaning of general RL are also discussed. As a testbed, hydrogen diffusivity is computed using RL TKS in pure metals and a medium entropy alloy, CrCoNi, and compared with experiments. The algorithm can produce counter-intuitive hydrogen-vacancy cooperative motion. We also demonstrate that RL LSS can accelerate the sampling of low-energy configurations compared to the Metropolis-Hastings algorithm, using hydrogen migration to copper (111) surface as an example.