Baseball hitting, swatter swing and football catching, there are many tasks can be seen as a one-time action, whose goal is to control the timing and parameters of the action to achieve optimal results. Many one-time motion problems are difficult to obtain the optimal policy through model solving, and model-free reinforcement learning has advantages for such problems. However, although reinforcement learning has developed rapidly, there is currently no universal one-time motion problem algorithm architecture. Decomposing the one-time motion problem into the action timing problem and the action parameter problem, we construct a suitable reinforcement learning method for each of them. We design a combination mechanism that allows the two modules to learn simultaneously by passing the estimated value between the two modules while interacting with the environment. We use REINFORCE + DPG to solve the problem of continuous motion parameter space, and use REINFORCE + Q learning to solve the problem of discrete motion parameter space. To testing the algorithm model, we designed and realized an aircraft bombing simulation environment. The test results show that the algorithm can converge quickly and stably, and is robust to different time step and observation errors.
Published in | Machine Learning Research (Volume 5, Issue 1) |
DOI | 10.11648/j.mlr.20200501.12 |
Page(s) | 10-17 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2020. Published by Science Publishing Group |
One-time Motion, Reinforcement Learning, Motion Control
[1] | Wen-yan Pang. Optimal Output Regulation of Partially Linear Discrete-Time Systems Using Reinforcement Learning. CPCC 2019. 2019: 252. |
[2] | J. Jabłońska, Ł. Szumiec, J. R. Parkitna. Reinforcement learning in a probabilistic learning task without time constraints. Pharmacological Reports. 2019, 71 (6). |
[3] | Paulo C. Heredia, Shaoshuai Mou. Distributed Multi-Agent Reinforcement Learning by Actor-Critic Method. IFAC Papers On Line. 2019, 52 (20). |
[4] | V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski. Playing Atari with Deep Reinforcement Learning. Nature. 518 (7540), 529 (2015). |
[5] | D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot. Human-level control through deep reinforcement learning. Nature. 529 (7587), 484 (2016). |
[6] | Zhen-peng Zhou, K. Steven, Li Li, Z. Richard N, R. Patrick. Optimization of Molecules via Deep Reinforcement Learning. Scientific reports. 2019, 9 (1). |
[7] | G. A. Rummery, M. Niranjan. On-line Q-learning using connectionist systems. vol. 37 (University of Cambridge, Department of Engineering Cambridge, England, 1994). |
[8] | R. S. Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. in International Conference on Neural Information Processing Systems (1995). pp. 1038–1044. |
[9] | C. J. C. H. Watkins, P. Dayan. Q -learning. Machine Learning. 8 (3-4), 279 (1992). |
[10] | V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with Deep Reinforcement Learning. Computer Science (2013). |
[11] | R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction (MIT press, 2018). |
[12] | R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. 8 (3-4), 229 (1992). |
[13] | I. H. Witten. An adaptive optimal controller for discrete-time Markov environments. Information & Control. 34 (4), 286 (1977). |
[14] | Sutton, Richard. Temporal credit assignment in reinforcement learning. Phd Thesis University of Massachusetts. 34 (5), 601 (1984). |
[15] | D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller. Deterministic policy gradient algorithms. in ICML (2014). |
APA Style
Boxuan Fan, Guiming Chen, Hongtao Lin. (2020). Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning. Machine Learning Research, 5(1), 10-17. https://doi.org/10.11648/j.mlr.20200501.12
ACS Style
Boxuan Fan; Guiming Chen; Hongtao Lin. Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning. Mach. Learn. Res. 2020, 5(1), 10-17. doi: 10.11648/j.mlr.20200501.12
AMA Style
Boxuan Fan, Guiming Chen, Hongtao Lin. Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning. Mach Learn Res. 2020;5(1):10-17. doi: 10.11648/j.mlr.20200501.12
@article{10.11648/j.mlr.20200501.12, author = {Boxuan Fan and Guiming Chen and Hongtao Lin}, title = {Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning}, journal = {Machine Learning Research}, volume = {5}, number = {1}, pages = {10-17}, doi = {10.11648/j.mlr.20200501.12}, url = {https://doi.org/10.11648/j.mlr.20200501.12}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20200501.12}, abstract = {Baseball hitting, swatter swing and football catching, there are many tasks can be seen as a one-time action, whose goal is to control the timing and parameters of the action to achieve optimal results. Many one-time motion problems are difficult to obtain the optimal policy through model solving, and model-free reinforcement learning has advantages for such problems. However, although reinforcement learning has developed rapidly, there is currently no universal one-time motion problem algorithm architecture. Decomposing the one-time motion problem into the action timing problem and the action parameter problem, we construct a suitable reinforcement learning method for each of them. We design a combination mechanism that allows the two modules to learn simultaneously by passing the estimated value between the two modules while interacting with the environment. We use REINFORCE + DPG to solve the problem of continuous motion parameter space, and use REINFORCE + Q learning to solve the problem of discrete motion parameter space. To testing the algorithm model, we designed and realized an aircraft bombing simulation environment. The test results show that the algorithm can converge quickly and stably, and is robust to different time step and observation errors.}, year = {2020} }
TY - JOUR T1 - Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning AU - Boxuan Fan AU - Guiming Chen AU - Hongtao Lin Y1 - 2020/03/24 PY - 2020 N1 - https://doi.org/10.11648/j.mlr.20200501.12 DO - 10.11648/j.mlr.20200501.12 T2 - Machine Learning Research JF - Machine Learning Research JO - Machine Learning Research SP - 10 EP - 17 PB - Science Publishing Group SN - 2637-5680 UR - https://doi.org/10.11648/j.mlr.20200501.12 AB - Baseball hitting, swatter swing and football catching, there are many tasks can be seen as a one-time action, whose goal is to control the timing and parameters of the action to achieve optimal results. Many one-time motion problems are difficult to obtain the optimal policy through model solving, and model-free reinforcement learning has advantages for such problems. However, although reinforcement learning has developed rapidly, there is currently no universal one-time motion problem algorithm architecture. Decomposing the one-time motion problem into the action timing problem and the action parameter problem, we construct a suitable reinforcement learning method for each of them. We design a combination mechanism that allows the two modules to learn simultaneously by passing the estimated value between the two modules while interacting with the environment. We use REINFORCE + DPG to solve the problem of continuous motion parameter space, and use REINFORCE + Q learning to solve the problem of discrete motion parameter space. To testing the algorithm model, we designed and realized an aircraft bombing simulation environment. The test results show that the algorithm can converge quickly and stably, and is robust to different time step and observation errors. VL - 5 IS - 1 ER -