模型未知非零和博弈问题的策略迭代算法

doi:10.12068/j.issn.1005-3026.2015.03.004

东北大学学报:自然科学版 ›› 2015, Vol. 36 ›› Issue (3): 318-322.DOI: 10.12068/j.issn.1005-3026.2015.03.004

模型未知非零和博弈问题的策略迭代算法

杨明¹，罗艳红¹，王义贺²

(1. 东北大学信息科学与工程学院，辽宁沈阳110819; 2. 国网辽宁省电力有限公司经济技术研究院，辽宁沈阳110000)

收稿日期:2014-01-08 修回日期:2014-01-08 出版日期:2015-03-15 发布日期:2014-11-07
通讯作者: 杨明
作者简介:杨明(1987-)，男，山东临沂人，东北大学博士研究生.
基金资助:
国家自然科学基金资助项目(61104010)；高等学校博士学科点专项科研基金资助项目(20110042120032).

Policy Iteration Algorithm for Nonzero-Sum Games with Unknown Models

YANG Ming¹， LUO Yan-hong¹， WANG Yi-he²

1. School of Information Science & Engineering， Northeastern University， Shenyang 110819， China; 2. Economic Technology Institute， Nation State Liaoning Province Power Co.， Ltd.， Shenyang 110000， China.

Received:2014-01-08 Revised:2014-01-08 Online:2015-03-15 Published:2014-11-07
Contact: YANG Ming
About author:-
Supported by:
-

摘要/Abstract

摘要： 提出了一种在线积分策略迭代算法，用来求解内部非线性动力模型未知的双人非零和博弈问题.通过在控制策略和干扰策略中引入探测信号，从而避开了系统的模型信息，得到了一个求解非零和博弈的无模型的近似动态规划算法.该算法同步更新值函数、控制策略、扰动策略，并且最终得到收敛的策略权值.在算法实现过程中，使用4个神经网络分别近似两个值函数、控制策略和扰动策略，使用最小二乘法估计神经网络的未知参数.最后仿真结果验证了算法的有效性.

关键词: 自适应动态规划, 非零和博弈, 策略迭代, 神经网络, 最优控制

Abstract: An online integral policy iteration algorithm was proposed to find the solution of two-player nonzero-sum differential games with completely unknown nonlinear continuous-time dynamics. Exploration signals can be added into the control and disturbance policies， rather than having to find the model information. An approximate dynamic programming (ADP) of model-free approach can be constructed， and the nonzero-sum games can be solved. The value function， control and disturbance policies simultaneously can be updated by the proposed algorithm， and converged policy weight parameters are obtained. To implement the algorithm， four neural networks are used respectively to approximate the two game value functions， the control policy and the disturbance policy. The least squares method is used to estimate the unknown parameters of the neural networks. The effectiveness of the developed scheme is demonstrated by a simulation example.

Key words: adaptive dynamic programming, nonzero-sum games, policy iteration, neural networks, optimal control

中图分类号:

TP183

杨明，罗艳红，王义贺. 模型未知非零和博弈问题的策略迭代算法[J]. 东北大学学报:自然科学版, 2015, 36(3): 318-322.

YANG Ming， LUO Yan-hong， WANG Yi-he. Policy Iteration Algorithm for Nonzero-Sum Games with Unknown Models[J]. Journal of Northeastern University Natural Science, 2015, 36(3): 318-322.

参考文献

[1]Vamvoudakis K G， Lewis F L.Multi-player non-zero-sum games:online adaptive learning solution of coupled Hamilton-Jacobi equations［J］.Automatica，2011，47(8):1556-1569.
[2]张化光，张欣，罗艳红，等，自适应动态规划综述［J］.自动化学报，2013，39(4):303-311.(Zhang Hua-Guang，Zhang Xin，Luo Yan-Hong ，et al.An overview of research on adaptive dynamic programming［J］.ACTA Automatica Sinica，2013，39(4):303-311.)
[3]刘德荣，李宏亮，王鼎.基于数据的自学习优化控制:研究进展与展望［J］.自动化学报，2013，39(11):1858-1870.(Liu De-rong，Li Hong-liang，Wang Ding.Data-based self-learning optimal control:research progress and prospects［J］.ACTA Automatica Sinica，2013，39(11):1858-1870.)
[4]Abu-Khalaf M， Lewis F L，Jie H.Neurodynamic programming and zero-sum games for constrained control systems［J］.IEEE Transactions on Neural Networks，2008，19(7):1243-1252.
[5]Al-Tamimi A， Abu-Khalaf M，Lewis F L.Adaptive critic designs for discrete-time zero-sum games with application to H infinity control［J］.IEEE Transactions on Systems，Man，and Cybernetics，Part B:Cybernetics，2007，37(1):240-247.
[6]Zhang H，Wei Q，Liu D.An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games［J］.Automatica，2011，47(1):207-214.
[7]Vrabie D， Lewis F.Integral reinforcement learning for online computation of feedback Nash strategies of nonzero-sum differential games［C］// 2010 49^th IEEE Conference on Decision and Control(CDC).Atlanta，2010:3066-3071.
[8]Huaguang Z，Lili C，Yanhong L.Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP［J］.IEEE Transactions on Cybernetics，2013，43(1):206-216.
[9]Jiang Y，Jiang Z P.Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics［J］.Automatica，2012，48(10):2699-2704.
[10]Li H，Liu D，Wang D.Integral policy iteration for zero-sum games with completely unknown nonlinear dynamics［C］// Neural Information Processing，20^th International Conference，ICONIP 2013.Berlin Heidelberg:Springer，2013:225-232.
[11]Gajic Z，Li T Y.Simulation results for two new algorithms for solving coupled algebraic Riccati equations［C］//In Third International Jymposium.on Differential Games.Nice，1988.

[1]	刘洋，闫冬梅，孟范伟. 基于Transformer改进的两分支行人重识别算法[J]. 东北大学学报（自然科学版）, 2023, 44(1): 26-32.
[2]	张春雷，李鹤，董茂林，张圣杰. 燃料电池空气供应系统自适应神经网络滑模控制[J]. 东北大学学报（自然科学版）, 2022, 43(9): 1270-1276.
[3]	马源源，刘晏泽，刘呈隆，张甜洁. 中国投资者多角度舆情分析及其在股市预测中的作用[J]. 东北大学学报（自然科学版）, 2022, 43(8): 1201-1209.
[4]	张禹，何楷文，李清书，巩亚东. 面向STEP-NC自由曲面特征的加工操作方法智能决策[J]. 东北大学学报（自然科学版）, 2022, 43(7): 981-987.
[5]	季策，张晓. 基于GSA-BP神经网络的OFDM系统信道估计算法[J]. 东北大学学报（自然科学版）, 2022, 43(6): 769-775.
[6]	杨博文，霍军周，张伟，张占葛. 服役结构超前载荷实时预测方法的研究[J]. 东北大学学报（自然科学版）, 2022, 43(4): 541-550.
[7]	范纯龙，李彦达，夏秀峰，乔建忠. 基于随机梯度上升和球面投影的通用对抗攻击方法[J]. 东北大学学报（自然科学版）, 2022, 43(2): 168-175.
[8]	陈兵，韩烬阳，唐晓垒，夏搏然. 基于机器学习的拉矫延伸率预测模型及数值分析[J]. 东北大学学报（自然科学版）, 2022, 43(2): 236-242.
[9]	井元伟，谢海修，白云. TCP/AWM网络系统的自适应有限时间漏斗拥塞控制[J]. 东北大学学报（自然科学版）, 2022, 43(10): 1369-1375.
[10]	王璐，王帅，张国峰，徐礼胜. 基于语义分割注意力与可见区域预测的行人检测方法[J]. 东北大学学报（自然科学版）, 2021, 42(9): 1261-1267.
[11]	郑艳，姜源祥. 基于特征融合的说话人聚类算法[J]. 东北大学学报（自然科学版）, 2021, 42(7): 952-959.
[12]	于洪亮，王旭，杨丹，李维军. 基于电流观测器的链式STATCOM反步控制方法[J]. 东北大学学报（自然科学版）, 2021, 42(6): 761-767.
[13]	张涛，刘天威，杜文丽. 一种基于卷积神经网络的区域调光技术[J]. 东北大学学报（自然科学版）, 2021, 42(5): 624-632.
[14]	廖志伟，陈琳韬，黄杰栋，庄竞. 基于特征空间变换与LSTM的中短期电煤价格预测[J]. 东北大学学报（自然科学版）, 2021, 42(4): 483-493.
[15]	张永超，李琦，任朝晖，周世华. 基于域适应与分类器差异的滚动轴承跨域故障诊断[J]. 东北大学学报（自然科学版）, 2021, 42(3): 367-372.

模型未知非零和博弈问题的策略迭代算法

Policy Iteration Algorithm for Nonzero-Sum Games with Unknown Models

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价