基于MASAC最大熵强化学习的跳波束卫星系统资源适配方案

doi:10.12068/j.issn.1005-3026.2025.20230252

摘要/Abstract

摘要：

针对跳波束卫星系统中通信终端多样化的业务需求导致星-地资源供需失配，以及上行传输中机器类终端能量资源受限的挑战，提出一种基于MASAC（multi-agent soft actor-critic）最大熵强化学习的资源适配方案.首先构建了两阶段传输系统模型，在星-地资源供需失配问题的基础上，研究跳波束与非正交多址接入（non-orthogonal multiple access，NOMA）的协同作用.同时，引入能量采集与收集机制，优化了终端设备能量采集与信号传输之间的关系.在此基础上，将上下行传输过程进行整合，建立跳波束图样选择，时隙分配以及速率与功率控制的多目标优化问题，并采用MASAC算法进行优化求解，得到最优联合控制方案.实验结果表明，所提方案能够有效进行资源分配以实现星-地资源供需匹配，并满足能量受限终端的信号传输需求.与基准算法相比，所提算法具有良好的性能.

关键词: 跳波束卫星, 非正交多址, 能量收集, 资源适配, 深度强化学习

Abstract:

To address the mismatch between space-to-ground resources supply and demand caused by the diversified traffic requirements of communication terminals in the beam-hopping satellite system，as well as the challenge of limited energy resources of machine-type devices in upward transmission，a resource adaptation scheme is proposed based on a multi-agent soft actor-critic（MASAC）approach utilizing maximum entropy reinforcement learning. Firstly，a two-stage transmission system model is constructed to investigate the synergistic effect of beam-hopping and non-orthogonal multiple access（NOMA）on the basis of the space-to-ground resource mismatch problem. Additionally，an energy harvesting and collection mechanism is introduced to optimize the relationship between terminal device energy harvesting and signal transmission. On this basis，a multi-objective optimization problem is established for beam-hopping pattern selection，time slot allocation，and rate and power control by integrating the uplink and downlink transmission processes. MASAC maximum entropy reinforcement learning is employed for optimization，obtaining an optimal joint control strategy. Experimental results show that the proposed scheme can effectively allocate resources for space-to-ground resource matching and meet the signal transmission requirements of energy-constrained machine terminals. Compared with the benchmark algorithm，the proposed algorithm exhibits superior performance.

Key words: beam-hopping satellite, non-orthogonal multiple access（NOMA）, energy harvesting, resource allocation, deep reinforcement learning

中图分类号:

TP 915

王译萱, 刘军. 基于MASAC最大熵强化学习的跳波束卫星系统资源适配方案[J]. 东北大学学报（自然科学版）, 2025, 46(2): 9-17.

Yi-xuan WANG, Jun LIU. Resource Adaptation Scheme for Beam-Hopping Satellite System Based on MASAC Maximum Entropy Reinforcement Learning[J]. Journal of Northeastern University(Natural Science), 2025, 46(2): 9-17.

图/表 9

图1 跳波束卫星系统资源适配方案总体框架

Fig.1 Overall framework of resource adaptation scheme for beam-hopping satellite system

图2 两阶段传输系统模型

Fig.2 Two-stage transmission system model

图3 能量收集与信号传输过程

Fig.3 Energy harvesting and signal transmission process

图4 基于MASAC的深度强化学习框架

Fig.4 Deep reinforcement learning structure based on MASAC

表1 主要仿真参数

Table 1 Main simulation parameters

仿真参数	取值
卫星轨道高度/km	1 000
下行链路工作频率/ GHz	20
系统带宽/MHz	500
卫星波束个数	5
服务小区总数/个	30
卫星最大天线增益/dBi	52
用户接收天线增益/dBi	21
卫星星载总功率/W 主用户发射功率/dBm	2 000 30
噪声功率密度/（dBm·Hz^-1）	-174
时隙长度/ ms	2
能量存储单元最大容量/J 能量收集效率系数	0.6 0.7

表2 MASAC算法参数设定

Table 2 MASAC algorithm parameters settings

MASCA算法参数	取值
训练轮次训练步数	400 100
经验池容量	1 000
折扣因子γ	0.9
学习率	0.001
批量训练数目	32
优化器算法	Adam

图5 采用不同算法时星-地供需流量关系的比较（a）—MASAC算法下的供需流量；（b）—MADDPG算法下的供需流量；（c）—Random算法下的供需流量.

Fig.5 Comparison of the supply-demand flow relationship between satellite and ground with different algorithms

图6 SU的平均吞吐量与PU辐射功率的关系

Fig.6 The relationship between the average throughput of SU and the radiated power of PU

图7 3种算法的收敛性能

Fig.7 Convergence performance of three algorithms

参考文献 22

1	Euler S， Fu X T， Hellsten S，et al. Using 3GPP technology for satellite communication［J］. Ericsson Technology Review，2023，2023（6）： 2-12.
2	何炬良.卫星通信中基于载波协同的随机多址接入技术研究［D］.北京：北京邮电大学，2018.
	He Ju-liang. Random multiple access based on carrier cooperation for satellite communication systerm［D］. Beijing： Beijing University of Posts and Telecommunications， 2018.
3	Hu X， Zhang Y C， Liao X L，et al. Dynamic beam hopping method based on multi-objective deep reinforcement learning for next generation satellite broadband systems［J］. IEEE Transactions on Broadcasting，2020，66（3）： 630-646.
4	Wang A Y， Lei L， Lagunas E，et al. Joint optimization of beam-hopping design and NOMA-assisted transmission for flexible satellite systems［J］. IEEE Transactions on Wireless Communications，2022，21（10）： 8846-8858.
5	Kamalinejad P， Mahapatra C， Sheng Z G，et al. Wireless energy harvesting for the Internet of things［J］. IEEE Communications Magazine，2015，53（6）： 102-108.
6	彭醇陵.基于射频能量收集的双向中继网络传输优化研究［D］.重庆：重庆邮电大学，2019.
	Peng Chun-ling. Research on transmission optimization strategy in two-way relay networks with RF energy harvesting ［D］. Chongqing： Chongqing University of Posts and Telecommunications，2019.
7	OPPO研究院.零功耗通信白皮书［R/OL］.（2022-01-19）［2023-04-18］. .
	OPPO Research Institute. Zero power communications white paper［R/OL］.（2022-01-19）［2023-04-18］.
8	Aravanis A I， Bhavani S M R， Arapoglou P D，et al. Power allocation in multibeam satellite systems： a two-stage multi-objective optimization［J］. IEEE Transactions on Wireless Communications，2015，14（6）： 3171-3182.
9	Wang W L， Wei J， Zhao S H，et al. Energy efficiency resource allocation based on spectrum-power tradeoff in distributed satellite cluster network［J］. Wireless Networks，2020，26（6）： 4389-4402.
10	Zhang M Y， Yang X M， Bu Z Y. Resource allocation with interference avoidance in beam-hopping based LEO satellite systems［C］//The 4th Information Communication Technologies Conference （ICTC）. Nanjing，2023： 83-88.
11	Zhang T， Zhang L X， Shi D Y. Resource allocation in beam hopping communication system［C］// IEEE/AIAA 37th Digital Avionics Systems Conference （DASC）. London，2018： 1-5.
12	Shi S C， Li G X， Li Z Q，et al. Joint power and bandwidth allocation for beam-hopping user downlinks in smart gateway multibeam satellite systems［J］. International Journal of Distributed Sensor Networks，2017，13（5）：155014771770946.
13	Wu S W， Zhang S， Li Q，et al. Study of non-orthogonal multiple access technology for satellite communications［C］// IEEE 8th International Conference on Computer and Communications （ICCC）. Chengdu，2022： 771-775.
14	Wang A Y， Lei L， Lagunas E，et al. Joint beam-hopping scheduling and power allocation in NOMA-assisted satellite systems［C］// IEEE Wireless Communications and Networking Conference （WCNC）. Nanjing，2021： 1-6.
15	Lin Z Y， Ni Z Y， Kuang L L，et al. Dynamic beam pattern and bandwidth allocation based on multi-agent deep reinforcement learning for beam hopping satellite systems［J］. IEEE Transactions on Vehicular Technology，2022，71（4）： 3917-3930.
16	徐素洁，胡欣，王银，等. 基于深度强化学习的卫星动态功率控制技术［J］. 陆军工程大学学报，2022，1（2）： 13-20.
	Xu Su-jie， Hu Xin， Wang Yin，et al. Dynamic power allocation technology for satellites based on deep reinforcement learning［J］. Journal of Army Engineering University of PLA，2022，1（2）： 13-20.
17	Wang X M， Zhang Y H， Shen R J，et al. DRL-based energy-efficient resource allocation frameworks for uplink NOMA systems［J］. IEEE Internet of Things Journal，2020，7（8）： 7279-7294.
18	Zhang H Y， Liu R K， Kaushik A，et al. Satellite edge computing with collaborative computation offloading： an intelligent deep deterministic policy gradient approach［J］. IEEE Internet of Things Journal，2023，10（10）： 9092-9107.
19	张严心，孔涵，殷辰堃，等.一类基于概率优先经验回放机制的分布式多智能体软行动-评论者算法［J］. 北京工业大学学报，2023，49（4）：459-466.
	Zhang Yan-xin， Kong Han， Yin Chen-kun，et al. Distributed multi-agent soft actor-critic algorithm with probabilistic prioritized experience replay［J］. Journal of Beijing University of Technology，2023，49（4）： 459-466.
20	Ghosh D， Hanawal M K， Zlatanov N. Learning to optimize energy efficiency in energy harvesting wireless sensor networks［J］. IEEE Wireless Communications Letters，2021，10（6）： 1153-1157.
21	Ding Z G， Schober R， Poor H V. No-pain No-gain： DRL assisted optimization in energy-constrained CR-NOMA networks［J］. IEEE Transactions on Communications，2021，69（9）： 5917-5932.
22	Wu D P， Liu T， Li Z D，et al. Delay-aware edge-terminal collaboration in green Internet of vehicles： a multiagent soft actor-critic approach［J］. IEEE Transactions on Green Communications and Networking， 2023， 7（2）： 1090-1102.

[1]	代钰, 景宗明, 杨雷, 高振. 部分可观测环境中基于图强化的任务卸载与资源分配方法[J]. 东北大学学报（自然科学版）, 2025, 46(1): 9-17.
[2]	赵钊，原培新，唐俊文，陈锦林. 基于改进SNN-HRL的智能体路径规划算法[J]. 东北大学学报（自然科学版）, 2023, 44(11): 1548-1555.
[3]	张雪峰，王照乙. 基于双决斗深度Q网络的自动换道决策模型[J]. 东北大学学报（自然科学版）, 2023, 44(10): 1369-1376.
[4]	刘军，代福成，辛宁. 基于多目标优化的虚拟机部署策略[J]. 东北大学学报（自然科学版）, 2022, 43(5): 609-617.