论文原文：Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

论文翻译 & 解读：[论文笔记]Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

代码地址：https://github.com/le-liang/MARLspectrumSharingV2X

博客中用到的VISIO流程图（由博主个人绘制，有错误欢迎交流指教）：https://download.csdn.net/download/m0_37495408/12353933

使用方法：

（来自原作者github的readme）

要训练多主体RL模型：main_marl_train.py + Environment_marl.py + replay_memory.py
要训练基准单一代理RL模型：main_sarl_train.py + Environment_marl.py + replay_memory.py
要在同一环境中测试所有模型：main_test.py + Environment_marl_test.py + replay_memory.py +'/ model'。
- 可以从运行“ main_test.py”直接复制本文中的图3和图4。通过“ Environment_marl_test.py”中的“ self.demand_size”更改V2V有效负载大小。
- 图5只能从培训期间的记录回报中获得。
- 图6-7显示了任意情节的表现（但随机基线失败且MARL传输成功）。实际上，大多数此类情节都表现出一些有趣的现象，表明了多主体合作。解释取决于读者。
- 不建议在“ main_marl_train.py”中使用“测试”模式。

基本类定义

在Environment_marl.py文件中，定义了架构的四个基本CLASS，分别是：V2Vchannels，V2Ichannels，Vehicle，Environ。其中Environ的方法（即函数）最多，Vehicle没有函数只有几个属性，其余两者各有两个方法（分别是计算路损和阴影衰落）。

Vehicle

初始化时需要传入三个参数：起始位置、起始方向、速度。函数内部将自己定义两个list：neighbors、destinations，分别存放邻居和V2V的通信端（这里两者在数值上相同，因为设定V2V的对象即为邻居）

class Vehicle:# Vehicle simulator: include all the information for a vehicledef __init__(self, start_position, start_direction, velocity):self.position = start_positionself.direction = start_directionself.velocity = velocityself.neighbors = []self.destinations = []

从下方的代码可见destionations的含义

    def renew_neighbor(self):  # 这个来自CLASS Env""" Determine the neighbors of each vehicles """for i in range(len(self.vehicles)):self.vehicles[i].neighbors = []self.vehicles[i].actions = []z = np.array([[complex(c.position[0], c.position[1]) for c in self.vehicles]])Distance = abs(z.T - z)for i in range(len(self.vehicles)):sort_idx = np.argsort(Distance[:, i])for j in range(self.n_neighbor):self.vehicles[i].neighbors.append(sort_idx[j + 1])destination = self.vehicles[i].neighborsself.vehicles[i].destinations = destination

V2Vchannels

内部参数z：这里将bs和ms的高度设置为1.5m，阴影的std为3，都是来自TR36 885-A.1.4-1；载波频率为2，单位为GHz；

class V2Vchannels:# Simulator of the V2V Channelsdef __init__(self):self.t = 0self.h_bs = 1.5self.h_ms = 1.5self.fc = 2self.decorrelation_distance = 10self.shadow_std = 3

包含两个方法：

计算路损

    def get_path_loss(self, position_A, position_B):d1 = abs(position_A[0] - position_B[0])d2 = abs(position_A[1] - position_B[1])d = math.hypot(d1, d2) + 0.001  # sqrt(x*x + y*y)# 下一行定义有效BP距离d_bp = 4 * (self.h_bs - 1) * (self.h_ms - 1) * self.fc * (10 ** 9) / (3 * 10 ** 8)def PL_Los(d):if d <= 3:return 22.7 * np.log10(3) + 41 + 20 * np.log10(self.fc / 5)else:if d < d_bp:return 22.7 * np.log10(d) + 41 + 20 * np.log10(self.fc / 5)else:return 40.0 * np.log10(d) + 9.45 - 17.3 * np.log10(self.h_bs) - 17.3 * np.log10(self.h_ms) + 2.7 * np.log10(self.fc / 5)def PL_NLos(d_a, d_b):n_j = max(2.8 - 0.0024 * d_b, 1.84)return PL_Los(d_a) + 20 - 12.5 * n_j + 10 * n_j * np.log10(d_b) + 3 * np.log10(self.fc / 5)if min(d1, d2) < 7:PL = PL_Los(d)else:PL = min(PL_NLos(d1, d2), PL_NLos(d2, d1))return PL  # + self.shadow_std * np.random.normal()

说明：上述代码使用随机过程模型（见[2]-p328）。

路损使用曼哈顿网格布局LOS模型，即：

$PL_{LOS}(d)_{|dB} = 10n_1 lg(d/1m)+28.0+20log(fc/1GHz)+PL_{I|dB}$ , for $10m <d<d_{BP}$

以及： $PL_{LOS}(d)_{|dB} = 10n_2 lg(d/d'_{BP})+PL_{LOS}(d'_{BP})_{|dB}$ , for $d'_{BP}<d<500m$

上面的n_1=2.2、n_2=4.0，分别表示在BP之前和之后的率衰落常数,d'表示有效BP距离，代码中用d_bp表示。

曼哈顿网格布局NLOS模型： $PL_{NLOS} = PL_{LOS}(d_1)_{|dB} + 17.9-12.5n_j+10n_jlg(d_2/1m)+3lg(f_c/1GHz)$ $+PL_{2|dB}$

代码后半出现的min函数，在[2]的p344页有描述，这是假设接收机位于垂直街道是对PL的估计方法。

代码中的公式出自IST-4-027756 WINNER II D1.1.2 V1.2 WINNER II

其中有如下表格，与代码中的参数完全符合：

更新阴影衰落

    def get_shadowing(self, delta_distance, shadowing):return np.exp(-1 * (delta_distance / self.decorrelation_distance)) * shadowing \+ math.sqrt(1 - np.exp(-2 * (delta_distance / self.decorrelation_distance))) * np.random.normal(0, 3)  # standard dev is 3 db

这个更新公式是出自文献[1]-A-1.4 Channel model表格后的部分，如下：

V2Ichannels

包含的两个方法和V2V相同，但是计算路损的时候不再区分Los了

    def get_path_loss(self, position_A):d1 = abs(position_A[0] - self.BS_position[0])d2 = abs(position_A[1] - self.BS_position[1])distance = math.hypot(d1, d2)return 128.1 + 37.6 * np.log10(math.sqrt(distance ** 2 + (self.h_bs - self.h_ms) ** 2) / 1000) # + self.shadow_std * np.random.normal()def get_shadowing(self, delta_distance, shadowing):nVeh = len(shadowing)self.R = np.sqrt(0.5 * np.ones([nVeh, nVeh]) + 0.5 * np.identity(nVeh))return np.multiply(np.exp(-1 * (delta_distance / self.Decorrelation_distance)), shadowing) \+ np.sqrt(1 - np.exp(-2 * (delta_distance / self.Decorrelation_distance))) * np.random.normal(0, 8, nVeh)

上面的两个方法均是文献[1]-Table A.1.4-2的内容和其后的说明，如下：

Environ

初始化需要传入4个list（为上下左右路口的位置数据）：down_lane, up_lane, left_lane, right_lane；地图的宽和高；车辆数和邻居数。除以上所提外，内部含有好多参数，如下：

class Environ:def __init__(self, down_lane, up_lane, left_lane, right_lane, width, height, n_veh, n_neighbor):self.V2Vchannels = V2Vchannels()self.V2Ichannels = V2Ichannels()self.vehicles = []self.demand = []self.V2V_Shadowing = []self.V2I_Shadowing = []self.delta_distance = []self.V2V_channels_abs = []self.V2I_channels_abs = []self.V2I_power_dB = 23  # dBmself.V2V_power_dB_List = [23, 15, 5, -100]  # the power levelsself.V2I_power = 10 ** (self.V2I_power_dB)self.sig2_dB = -114self.bsAntGain = 8self.bsNoiseFigure = 5self.vehAntGain = 3self.vehNoiseFigure = 9self.sig2 = 10 ** (self.sig2_dB / 10)self.n_RB = n_vehself.n_Veh = n_vehself.n_neighbor = n_neighborself.time_fast = 0.001self.time_slow = 0.1  # update slow fading/vehicle position every 100 msself.bandwidth = int(1e6)  # bandwidth per RB, 1 MHz# self.bandwidth = 1500self.demand_size = int((4 * 190 + 300) * 8 * 2)  # V2V payload: 1060 Bytes every 100 ms# self.demand_size = 20self.V2V_Interference_all = np.zeros((self.n_Veh, self.n_neighbor, self.n_RB)) + self.sig2

添加车：有两个方法：add_new_vehivles(需要传输起始坐标、方向、速度)，add_new_vehicles_by_number（n）。后者比较有意思，只需要一个参数，n，但是并不是添加n辆车，而是4n辆车，上下左右方向各一台，位置是随机的。

更新车辆位置：renew_position(无)，遍历每辆车，根据其方向和速度更新位置，到路口时依据概率顺时针转弯，到地图边界时使其顺时针转弯留在地图中。

更新邻居：renew_neighbor(self)，已经在Vehicle中进行描述

更新信道：renew_channel(self)，这里定义了一个很重要的量：channel_abs，它是路损和阴影衰落的和。【内含所有车辆的信息】

    def renew_channel(self):""" Renew slow fading channel """self.V2V_pathloss = np.zeros((len(self.vehicles), len(self.vehicles))) + 50 * np.identity(len(self.vehicles))self.V2I_pathloss = np.zeros((len(self.vehicles)))self.V2V_channels_abs = np.zeros((len(self.vehicles), len(self.vehicles)))self.V2I_channels_abs = np.zeros((len(self.vehicles)))for i in range(len(self.vehicles)):for j in range(i + 1, len(self.vehicles)):self.V2V_Shadowing[j][i] = self.V2V_Shadowing[i][j] = self.V2Vchannels.get_shadowing(self.delta_distance[i] + self.delta_distance[j], self.V2V_Shadowing[i][j])self.V2V_pathloss[j,i] = self.V2V_pathloss[i][j] = self.V2Vchannels.get_path_loss(self.vehicles[i].position, self.vehicles[j].position)self.V2V_channels_abs = self.V2V_pathloss + self.V2V_Shadowingself.V2I_Shadowing = self.V2Ichannels.get_shadowing(self.delta_distance, self.V2I_Shadowing)for i in range(len(self.vehicles)):self.V2I_pathloss[i] = self.V2Ichannels.get_path_loss(self.vehicles[i].position)self.V2I_channels_abs = self.V2I_pathloss + self.V2I_Shadowing

更新快衰落信道：renew_channels_fastfading(self)，其数值为把channels_abs减了一个随机数，这里在减之前将channels_abs增加了一维，层数为RB的个数。

    def renew_channels_fastfading(self):""" Renew fast fading channel """# 1 2, 3 4 --> 1 1 2 2 3 3 4 4 逐个元素复制V2V_channels_with_fastfading = np.repeat(self.V2V_channels_abs[:, :, np.newaxis], self.n_RB, axis=2)# A - 20 logself.V2V_channels_with_fastfading = V2V_channels_with_fastfading - 20 * np.log10(np.abs(np.random.normal(0, 1, V2V_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2V_channels_with_fastfading.shape)) / math.sqrt(2))# 1 2, 3 4 --> 1 1 2 2, 3 3 4 4V2I_channels_with_fastfading = np.repeat(self.V2I_channels_abs[:, np.newaxis], self.n_RB, axis=1)self.V2I_channels_with_fastfading = V2I_channels_with_fastfading - 20 * np.log10(np.abs(np.random.normal(0, 1, V2I_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2I_channels_with_fastfading.shape))/ math.sqrt(2))

计算Reward：Compute_Performance_Reward_Train(self, actions_power)，这里的输入非常重要，是RL的action，其定义在main_marl_train.py中，是个三维数组，以（层，行，列）进行说明，一层一个车，一行一个邻居，共有两列分别为RB选择（用RB的序号表示）和power选择（也用序号表示，作为power_db_list的索引），如下所示：

            for i in range(n_veh):for j in range(n_neighbor):state_old = get_state(env, [i, j], 1, epsi_final)action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)action_all_testing[i, j, 0] = action % n_RB  # chosen RBaction_all_testing[i, j, 1] = int(np.floor(action / n_RB))  # power level

具体计算步骤为：

从action中取出RB选择、power选择
计算V2I信道容量 V2I_rate # 返回值的长度是RB个数，但实际含义是V2I链路的数目，因为V2I链路数=RB个数
计算V2V信道容量V2V_rate # 返回值中一格对应一个V2V链路，这里返回的是所有V2V的速率
1. 遍历每一个RB，从actions找到共用一个RB的车号
2. 分V2I对V2V的干扰、V2V之间的干扰两步，计算信道容量
计算剩余demand和time_limit的剩余时间
生成reward（reward_elements = V2V_Rate/10,并且demand=0的记作1）
根据剩余demand将active_links置0（这是唯二修改active_links的方法，另一种是初始化active_links时将其全部置一）
1. 将active_links置1的场合
  1. env.py中，new_random_game时（该函数在 *train.py中在最开始出现过一次）
  2. *train.py中episode的开端，直接对active_links置一

代码如下：

    def Compute_Performance_Reward_Train(self, actions_power):actions = actions_power[:, :, 0]  # the channel_selection_partpower_selection = actions_power[:, :, 1]  # power selection# ------------ Compute V2I rate --------------------V2I_Rate = np.zeros(self.n_RB)V2I_Interference = np.zeros(self.n_RB)  # V2I interferencefor i in range(len(self.vehicles)):for j in range(self.n_neighbor):if not self.active_links[i, j]:continueV2I_Interference[actions[i][j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]] - self.V2I_channels_with_fastfading[i, actions[i, j]]+ self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)self.V2I_Interference = V2I_Interference + self.sig2V2I_Signals = 10 ** ((self.V2I_power_dB - self.V2I_channels_with_fastfading.diagonal() + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)V2I_Rate = np.log2(1 + np.divide(V2I_Signals, self.V2I_Interference))  # 计算V2I信道容量# ------------ Compute V2V rate -------------------------V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor))V2V_Signal = np.zeros((len(self.vehicles), self.n_neighbor))actions[(np.logical_not(self.active_links))] = -1 # inactive links will not transmit regardless of selected power levelsfor i in range(self.n_RB):  # scanning all bandsindexes = np.argwhere(actions == i)  # find spectrum-sharing V2Vsfor j in range(len(indexes)):receiver_j = self.vehicles[indexes[j, 0]].destinations[indexes[j, 1]]V2V_Signal[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]- self.V2V_channels_with_fastfading[indexes[j][0], receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)# V2I links interference to V2V linksV2V_Interference[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i, receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)#  V2V interferencefor k in range(j + 1, len(indexes)):  # spectrum-sharing V2Vsreceiver_k = self.vehicles[indexes[k][0]].destinations[indexes[k][1]]V2V_Interference[indexes[j, 0], indexes[j, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[k, 0], indexes[k, 1]]]- self.V2V_channels_with_fastfading[indexes[k][0]][receiver_j][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)V2V_Interference[indexes[k, 0], indexes[k, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]- self.V2V_channels_with_fastfading[indexes[j][0]][receiver_k][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)self.V2V_Interference = V2V_Interference + self.sig2V2V_Rate = np.log2(1 + np.divide(V2V_Signal, self.V2V_Interference))self.demand -= V2V_Rate * self.time_fast * self.bandwidthself.demand[self.demand < 0] = 0 # eliminate negative demandsself.individual_time_limit -= self.time_fastreward_elements = V2V_Rate/10reward_elements[self.demand <= 0] = 1self.active_links[np.multiply(self.active_links, self.demand <= 0)] = 0 # transmission finished, turned to "inactive"return V2I_Rate, V2V_Rate, reward_elements

注：这里返回三个数值，其中最后一个并不是最终的reward，最终的reward需要把这三个数值加权组合起来。

执行训练：act_for_training(self, actions)，输入actions，通过Compute_Performance_Reward_Train计算最终reward，代码如下：

    def act_for_training(self, actions):action_temp = actions.copy()V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)lambdda = 0.reward = lambdda * np.sum(V2I_Rate) / (self.n_Veh * 10) + (1 - lambdda) * np.sum(reward_elements) / (self.n_Veh * self.n_neighbor)return reward

执行测试：act_for_testing(self, actions)，这里和上面差不多，也用到了Compute_Performance_Reward_Train，但最后返回的是V2I_rate, V2V_success, V2V_rate。

    def act_for_testing(self, actions):action_temp = actions.copy()V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)V2V_success = 1 - np.sum(self.active_links) / (self.n_Veh * self.n_neighbor)  # V2V success ratesreturn V2I_Rate, V2V_success, V2V_Rate

上面所述的三个量，是一次episode中的单步step所生成的最终结果，main_marl_train.py的testing部分可以看到，部分代码如下：

        for test_step in range(n_step_per_episode):# trained modelsaction_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')for i in range(n_veh):for j in range(n_neighbor):state_old = get_state(env, [i, j], 1, epsi_final)action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)action_all_testing[i, j, 0] = action % n_RB  # chosen RBaction_all_testing[i, j, 1] = int(np.floor(action / n_RB))  # power levelaction_temp = action_all_testing.copy()V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)V2I_rate_per_episode.append(np.sum(V2I_rate))  # sum V2I rate in bpsrate_marl[idx_episode, test_step,:,:] = V2V_ratedemand_marl[idx_episode, test_step+1,:,:] = env.demand

计算干扰：Compute_Interference(self, actions)，通过+=的方法计算V2V_Interference_all，代码如下：

    def Compute_Interference(self, actions):V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor, self.n_RB)) + self.sig2channel_selection = actions.copy()[:, :, 0]  # 取所有层的第0列power_selection = actions.copy()[:, :, 1]  # 取所有层的第1列channel_selection[np.logical_not(self.active_links)] = -1  # 将未激活的链路置为-1# interference from V2I linksfor i in range(self.n_RB):for k in range(len(self.vehicles)):for m in range(len(channel_selection[k, :])):V2V_Interference[k, m, i] += 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)# interference from peer V2V linksfor i in range(len(self.vehicles)):for j in range(len(channel_selection[i, :])):for k in range(len(self.vehicles)):for m in range(len(channel_selection[k, :])):# if i == k or channel_selection[i,j] >= 0:if i == k and j == m or channel_selection[i, j] < 0:continueV2V_Interference[k, m, channel_selection[i, j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]]- self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][channel_selection[i,j]] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)self.V2V_Interference_all = 10 * np.log10(V2V_Interference)

在main_marl_train.py的get_state中有用到，用于构成state中的V2V_interference，如下：

def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):""" Get state from the environment """# include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息（PL+shadow）,# 剩余时间, 剩余负载# V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35# V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])# return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))# 这里有所有感兴趣的物理量：V2V_fast V2I_fast V2V_interference V2I_abs V2V_abs

有的小伙伴看到这就有点迷了，为什么这里又要计算V2V_Interference了？我怎么感觉之前好像算过，是的，在计算V2V_rate的时候就需要计算V2V_Interference，我目前观察那个是按照RB分配来算的，这个是直接按照车挨个遍历的。

ReplayMemory

这部分内容来自replay_memory.py，内容不多，只定义了一个类: ReplayMemory，需要注意的是每一个agent都有一个memory，在main_marl_train.py--class Agent可以看到，如下所示

class Agent(object):def __init__(self, memory_entry_size):self.discount = 1self.double_q = Trueself.memory_entry_size = memory_entry_sizeself.memory = ReplayMemory(self.memory_entry_size)

初始化：需要输入memory的容量：entry_size，初始化的代码如下：

class ReplayMemory:def __init__(self, entry_size):self.entry_size = entry_sizeself.memory_size = 200000self.actions = np.empty(self.memory_size, dtype = np.uint8)self.rewards = np.empty(self.memory_size, dtype = np.float64)self.prestate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)self.poststate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)self.batch_size = 2000self.count = 0self.current = 0

添加（s, a）对：add(self, prestate, poststate, reward, action)，从add方法的参数可以看出参数包括：（上一个状态，下一个状态，奖励，动作），代码如下：

    def add(self, prestate, poststate, reward, action):self.actions[self.current] = actionself.rewards[self.current] = rewardself.prestate[self.current] = prestateself.poststate[self.current] = poststateself.count = max(self.count, self.current + 1)self.current = (self.current + 1) % self.memory_size

对每个agent来说，都需要将自己在每个time_step将这个状态转移的信息记录下来，在main_marl_train.py--Training的部分可以看到add的使用，代码如下，这个for循环上面还有一个对于episode的for循环，可以看出，在每个episode的每个step，都需要对所有agent进行（s，a）对的添加【最后一行】

        for i_step in range(n_step_per_episode):  # range内是0.1/0.001 = 100time_step = i_episode*n_step_per_episode + i_step  # time_step是整体的stepstate_old_all = []action_all = []action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')for i in range(n_veh):for j in range(n_neighbor):state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)state_old_all.append(state)action = predict(sesses[i*n_neighbor+j], state, epsi)action_all.append(action)action_all_training[i, j, 0] = action % n_RB  # chosen RBaction_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level# All agents take actions simultaneously, obtain shared reward, and update the environment.action_temp = action_all_training.copy()train_reward = env.act_for_training(action_temp)record_reward[time_step] = train_rewardenv.renew_channels_fastfading()env.Compute_Interference(action_temp)for i in range(n_veh):for j in range(n_neighbor):state_old = state_old_all[n_neighbor * i + j]action = action_all[n_neighbor * i + j]state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action)  # add entry to this agent's memory

采样：sample(self)，经过多次add后，每个agent已经有了多个（s,a）对，但是实际训练的时候一次取出batch_size个（s,a）对进行训练，代码如下所示：

    def sample(self):if self.count < self.batch_size:indexes = range(0, self.count)else:indexes = random.sample(range(0,self.count), self.batch_size)prestate = self.prestate[indexes]poststate = self.poststate[indexes]actions = self.actions[indexes]rewards = self.rewards[indexes]return prestate, poststate, actions, rewards

主代码-main_marl_train.py

定义CLASS Agent：Agent(object)，无输入参数，内容是一些算法参数，注意memory的实现方法是ReplayMemory，上面刚提到过

class Agent(object):def __init__(self, memory_entry_size):self.discount = 1self.double_q = Trueself.memory_entry_size = memory_entry_sizeself.memory = ReplayMemory(self.memory_entry_size)

参数初始化：这部分直接写在代码中，没有函数，大概包括：地图属性（路口坐标，整体地图尺寸）、#车、#邻居、#RB、#episode，一些算法参数，代码如下：

对于地图参数 up_lanes / down_lanes / left_lanes / right_lanes 的含义，首先要了解本次所用的系统模型由3GPP TR 36.885的城市案例给出，每条街有四个车道（正反方向各两个车道），车道宽3.5m，模型网格（road grid）的尺寸以黄线之间的距离确定，为433m*250m，区域面积为1299m*750m。仿真中等比例缩小为原来的1/2（这点可以由 width 和 height 参数是 / 2 的看出来），反映在车道的参数上就是在 lanes 中的 i / 2.0 。

下面以 up_lanes 为例进行说明。在上图中我们可以看到，车道宽3.5m，所以将车视作质点的话，应该是在3.5m的车道中间移动的，因此在 up_lanes 中 in 后面的中括号里 3.5 需要 /2，第二项的3.5就是通向双车道的第二条车道的中间；第三项 +250 就是越过建筑物的第一条同向车道，以此类推。

up_lanes = [i/2.0 for i in [3.5/2, 3.5 + 3.5/2, 250+3.5/2, 250+3.5+3.5/2, 500+3.5/2, 500+3.5+3.5/2]]
down_lanes = [i/2.0 for i in [250-3.5-3.5/2,250-3.5/2,500-3.5-3.5/2,500-3.5/2,750-3.5-3.5/2,750-3.5/2]]
left_lanes = [i/2.0 for i in [3.5/2,3.5/2 + 3.5,433+3.5/2, 433+3.5+3.5/2, 866+3.5/2, 866+3.5+3.5/2]]
right_lanes = [i/2.0 for i in [433-3.5-3.5/2,433-3.5/2,866-3.5-3.5/2,866-3.5/2,1299-3.5-3.5/2,1299-3.5/2]]width = 750/2
height = 1298/2IS_TRAIN = 1
IS_TEST = 1-IS_TRAINlabel = 'marl_model'n_veh = 4
n_neighbor = 1
n_RB = n_vehenv = Environment_marl.Environ(down_lanes, up_lanes, left_lanes, right_lanes, width, height, n_veh, n_neighbor)
env.new_random_game()  # initialize parameters in env# n_episode = 3000
n_episode = 600
n_step_per_episode = int(env.time_slow/env.time_fast)  # slow = 0.1, fast = 0.001
epsi_final = 0.02
epsi_anneal_length = int(0.8*n_episode)
mini_batch_step = n_step_per_episode
target_update_step = n_step_per_episode*4n_episode_test = 100  # test episodes

获取状态：get_state(env, idx=(0,0), ind_episode=1., epsi=0.02)，输入是env（环境），输出包括：

V2V_fast：(PL+shadowing) - 随机数（在本文 基本类定义 -- Environ -- 更新快衰信道 一节有）
V2I_fast：同上
V2V_interference（在本文 基本类定义 -- Environ -- 计算干扰 一节有）
V2I_abs（PL+shadowing）
V2V_abs（PL+shadowing）

需要注意的是，代码中的V2I_abs出现了-80，/60 的操作，这个将代码作者在github讨论区的解释放在这里：

This is to roughly normalize DQN inputs for the ease of training. The numbers are obtained from several trial runs

“这是为了使DQN输入大致标准化，以便于培训。这些数字是从几次试运行获得的”

def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):""" Get state from the environment """# include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息（PL+shadow）,# 剩余时间, 剩余负载# V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35# V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])# return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))

定义NN：

with g.as_default():# ============== Training network ========================x = tf.placeholder(tf.float32, [None, n_input])  # 输入w_1 = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))w_2 = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))w_3 = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))w_4 = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))b_1 = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))b_2 = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))b_3 = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))b_4 = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))layer_1 = tf.nn.relu(tf.add(tf.matmul(x, w_1), b_1))layer_1_b = tf.layers.batch_normalization(layer_1)layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1_b, w_2), b_2))layer_2_b = tf.layers.batch_normalization(layer_2)layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2_b, w_3), b_3))layer_3_b = tf.layers.batch_normalization(layer_3)y = tf.nn.relu(tf.add(tf.matmul(layer_3_b, w_4), b_4))g_q_action = tf.argmax(y, axis=1)# compute lossg_target_q_t = tf.placeholder(tf.float32, None, name="target_value")g_action = tf.placeholder(tf.int32, None, name='g_action')action_one_hot = tf.one_hot(g_action, n_output, 1.0, 0.0, name='action_one_hot')q_acted = tf.reduce_sum(y * action_one_hot, reduction_indices=1, name='q_acted')g_loss = tf.reduce_mean(tf.square(g_target_q_t - q_acted), name='g_loss')  # 求误差optim = tf.train.RMSPropOptimizer(learning_rate=0.001, momentum=0.95, epsilon=0.01).minimize(g_loss)  # 梯度下降# ==================== Prediction network ========================x_p = tf.placeholder(tf.float32, [None, n_input])  # 输入w_1_p = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))w_2_p = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))w_3_p = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))w_4_p = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))b_1_p = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))b_2_p = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))b_3_p = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))b_4_p = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))layer_1_p = tf.nn.relu(tf.add(tf.matmul(x_p, w_1_p), b_1_p))layer_1_p_b = tf.layers.batch_normalization(layer_1_p)layer_2_p = tf.nn.relu(tf.add(tf.matmul(layer_1_p_b, w_2_p), b_2_p))layer_2_p_b = tf.layers.batch_normalization(layer_2_p)layer_3_p = tf.nn.relu(tf.add(tf.matmul(layer_2_p_b, w_3_p), b_3_p))layer_3_p_b = tf.layers.batch_normalization(layer_3_p)y_p = tf.nn.relu(tf.add(tf.matmul(layer_3_p_b, w_4_p), b_4_p))g_target_q_idx = tf.placeholder('int32', [None, None], 'output_idx')  # 输入，这是一个（n, 2）的listtarget_q_with_idx = tf.gather_nd(y_p, g_target_q_idx)  # 提取首参的某几行/列init = tf.global_variables_initializer()saver = tf.train.Saver()

在这里仅说明大体结构，具体含义请见下问“采样并获得loss”部分，有结合算法原理的Network结构说明。

整体分成三个NN：Training，compute_loss，Prediction，分别用N1 N2 N3表示。其中N1和N3结构完全一致，为算法结构中的DQN网络，输出Q值，不同点在于，N1每次迭代式都更新，而N3每隔一段时间更新一次。N2接受N1的输入，负责计算Q函数并对N1实现迭代更新。

在

预测：predict(sess, s_t, ep, test_ep = False)，此函数用于驱动NN，生成动作action，代码如下：

def predict(sess, s_t, ep, test_ep = False):n_power_levels = len(env.V2V_power_dB_List)if np.random.rand() < ep and not test_ep:pred_action = np.random.randint(n_RB*n_power_levels)else:pred_action = sess.run(g_q_action, feed_dict={x: [s_t]})[0]return pred_action

这里的action是一个int，但内涵了RB和power_level的信息，在本代码后面Training和Testing中都有出现，使用方法如下：

                    action = predict(sesses[i*n_neighbor+j], state, epsi)action_all.append(action)action_all_training[i, j, 0] = action % n_RB  # chosen RBaction_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level

采样并获得loss：q_learning_mini_batch(current_agent, current_sess)，输入单个agent，里面用到了CLASS:memory的sample方法，上面有提到。此外double q-learning也在这里设置。

def q_learning_mini_batch(current_agent, current_sess):""" Training a sampled mini-batch """batch_s_t, batch_s_t_plus_1, batch_action, batch_reward = current_agent.memory.sample()if current_agent.double_q:  # double q-learningpred_action = current_sess.run(g_q_action, feed_dict={x: batch_s_t_plus_1})q_t_plus_1 = current_sess.run(target_q_with_idx, {x_p: batch_s_t_plus_1, g_target_q_idx: [[idx, pred_a] for idx, pred_a in enumerate(pred_action)]})batch_target_q_t = current_agent.discount * q_t_plus_1 + batch_rewardelse:q_t_plus_1 = current_sess.run(y_p, {x_p: batch_s_t_plus_1})max_q_t_plus_1 = np.max(q_t_plus_1, axis=1)batch_target_q_t = current_agent.discount * max_q_t_plus_1 + batch_reward_, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})return loss_val

4.23 补充：这个函数需要结合NN的结构来看，个人感觉还是有点复杂的。如表面意思通过 if 表现了不同DQN和double q-learning两种方法，需要注意的是在两个if里面都只计算了target network的部分，算法图左上方的Network的输入、迭代更新由最后一句完成：

    _, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})

这段代码需要和这篇博客中的图相对应才可以理解，在这里将算法原理图和代码流程图贴出来（代码图由博主通过VISIO绘制，没有遵循标准格式，有错误请见谅）

普通DQN

Double DQN

与普通DQN在target network处有不同，前者直接通过Predict Network（上图的‘predict/每隔一段时间更新一次’）和max构成target network，但是doubkle DQN将training network和Predict Network级联构成target network。

Training环节

for i in episode:(对于一次完整的episode迭代)

根据i确定epsi（递增->不变）
每100次更新一次位置、邻居、快衰、信道。
初始化demand time_limit active_links(全1)
for i_step in episode:(对于episode中的每一步)：
- 初始化state_old_all,action_all action_all_training
- for循环：对每一个链路
  - 获取该链路的state【对于单个链路】
  - 通过predict得到action（包含RB和POWER的信息）【对于单个链路】
  - 根据action得到action_all_trainging = [车，邻居，RB/power]【讲单个链路的内容存储起来】
- 通过action_for_training得到reward【这里是对于所有链路的】如果是sarl，则把计算reward的放到上面的for内，其他一样
- 把reward加入record_reward
- 更新快衰
- 根据action计算干扰
- 使用for循环对每个链路
  - 计算新状态
  - 将（state_old,state_new,train_reward,action)加入agent的memory中【所以说这里的memory每一条是对于单个链路的】
  - 每当得到mini_batch_step个新状态后：通过Q-learning_mini_batch得到loss
  - 每当到达target_update_step后，更新target_q_network

record_reward = np.zeros([n_episode*n_step_per_episode, 1])
record_loss = []
if IS_TRAIN:for i_episode in range(n_episode):print("-------------------------")print('Episode:', i_episode)if i_episode < epsi_anneal_length:epsi = 1 - i_episode * (1 - epsi_final) / (epsi_anneal_length - 1)  # epsilon decreases over each episodeelse:epsi = epsi_final# 每迭代100次更新一次位置、邻居、信道、快衰if i_episode%100 == 0:env.renew_positions() # update vehicle positionenv.renew_neighbor()env.renew_channel() # update channel slow fadingenv.renew_channels_fastfading() # update channel fast fadingenv.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')for i_step in range(n_step_per_episode):  # range内是0.1/0.001 = 100time_step = i_episode*n_step_per_episode + i_step  # time_step是整体的stepstate_old_all = []action_all = []action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')for i in range(n_veh):for j in range(n_neighbor):state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)state_old_all.append(state)action = predict(sesses[i*n_neighbor+j], state, epsi)action_all.append(action)action_all_training[i, j, 0] = action % n_RB  # chosen RBaction_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level# All agents take actions simultaneously, obtain shared reward, and update the environment.action_temp = action_all_training.copy()train_reward = env.act_for_training(action_temp)record_reward[time_step] = train_rewardenv.renew_channels_fastfading()env.Compute_Interference(action_temp)for i in range(n_veh):for j in range(n_neighbor):state_old = state_old_all[n_neighbor * i + j]action = action_all[n_neighbor * i + j]state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action)  # add entry to this agent's memory# training this agentif time_step % mini_batch_step == mini_batch_step-1:loss_val_batch = q_learning_mini_batch(agents[i*n_neighbor+j], sesses[i*n_neighbor+j])record_loss.append(loss_val_batch)if i == 0 and j == 0:print('step:', time_step, 'agent',i*n_neighbor+j, 'loss', loss_val_batch)if time_step % target_update_step == target_update_step-1:update_target_q_network(sesses[i*n_neighbor+j])if i == 0 and j == 0:print('Update target Q network...')print('Training Done. Saving models...')for i in range(n_veh):for j in range(n_neighbor):model_path = label + '/agent_' + str(i * n_neighbor + j)save_models(sesses[i * n_neighbor + j], model_path)current_dir = os.path.dirname(os.path.realpath(__file__))reward_path = os.path.join(current_dir, "model/" + label + '/reward.mat')scipy.io.savemat(reward_path, {'reward': record_reward})record_loss = np.asarray(record_loss).reshape((-1, n_veh*n_neighbor))loss_path = os.path.join(current_dir, "model/" + label + '/train_loss.mat')scipy.io.savemat(loss_path, {'train_loss': record_loss})

Testing环节

首先加载training得到的模型

for i in episode:(对于一次完整的episode迭代)

更新位置、邻居、快衰、信道。
初始化demand time_limit active_links(全1)
for i_step in episode:(对于episode中的每一步)：
- 初始化state_old_all,action_all action_all_testing
- 通过predict得到action（包含RB和POWER的信息）
- 根据action得到action_all_traingingaction_all_testing = [车，邻居，RB/power]
- 通过action_for_trainingaction_for_testing得到reward V2I_rate, V2V_success, V2V_rate
- 对V2I_rate求和并加入V2I_rate_per_episode
- 将V2V_rate加入rate_marl
- 更新demand

if IS_TEST:print("\nRestoring the model...")for i in range(n_veh):for j in range(n_neighbor):model_path = label + '/agent_' + str(i * n_neighbor + j)load_models(sesses[i * n_neighbor + j], model_path)V2I_rate_list = []V2V_success_list = []V2I_rate_list_rand = []V2V_success_list_rand = []rate_marl = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])rate_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])demand_marl = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])demand_rand = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])power_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])for idx_episode in range(n_episode_test):print('----- Episode', idx_episode, '-----')env.renew_positions()env.renew_neighbor()env.renew_channel()env.renew_channels_fastfading()env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')env.demand_rand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))env.individual_time_limit_rand = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))env.active_links_rand = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')V2I_rate_per_episode = []V2I_rate_per_episode_rand = []for test_step in range(n_step_per_episode):# trained modelsaction_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')for i in range(n_veh):for j in range(n_neighbor):state_old = get_state(env, [i, j], 1, epsi_final)action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)action_all_testing[i, j, 0] = action % n_RB  # chosen RBaction_all_testing[i, j, 1] = int(np.floor(action / n_RB))  # power levelaction_temp = action_all_testing.copy()V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)V2I_rate_per_episode.append(np.sum(V2I_rate))  # sum V2I rate in bpsrate_marl[idx_episode, test_step,:,:] = V2V_ratedemand_marl[idx_episode, test_step+1,:,:] = env.demand# random baselineaction_rand = np.zeros([n_veh, n_neighbor, 2], dtype='int32')action_rand[:, :, 0] = np.random.randint(0, n_RB, [n_veh, n_neighbor]) # bandaction_rand[:, :, 1] = np.random.randint(0, len(env.V2V_power_dB_List), [n_veh, n_neighbor]) # powerV2I_rate_rand, V2V_success_rand, V2V_rate_rand = env.act_for_testing_rand(action_rand)V2I_rate_per_episode_rand.append(np.sum(V2I_rate_rand))  # sum V2I rate in bpsrate_rand[idx_episode, test_step, :, :] = V2V_rate_randdemand_rand[idx_episode, test_step+1,:,:] = env.demand_randfor i in range(n_veh):for j in range(n_neighbor):power_rand[idx_episode, test_step, i, j] = env.V2V_power_dB_List[int(action_rand[i, j, 1])]# update the environment and compute interferenceenv.renew_channels_fastfading()env.Compute_Interference(action_temp)if test_step == n_step_per_episode - 1:V2V_success_list.append(V2V_success)V2V_success_list_rand.append(V2V_success_rand)V2I_rate_list.append(np.mean(V2I_rate_per_episode))V2I_rate_list_rand.append(np.mean(V2I_rate_per_episode_rand))print(round(np.average(V2I_rate_per_episode), 2), 'rand', round(np.average(V2I_rate_per_episode_rand), 2))print(V2V_success_list[idx_episode], 'rand', V2V_success_list_rand[idx_episode])

参考自

[1]3GPP TR36.885报告

[2]《5G移动通信技术》

[代码解读]基于多代理RL的车联网频谱分享_Python实现相关推荐

图像分割套件PaddleSeg全面解析（一）train.py代码解读
首先祝贺百度团队百度斩获NeurIPS2020挑战赛冠军,https://www.jiqizhixin.com/articles/2020-12-09-2. 在此次比赛中使用的是基于飞桨深度学习框架开 ...
基于SegNet和UNet的遥感图像分割代码解读
基于SegNet和UNet的遥感图像分割代码解读目录基于SegNet和UNet的遥感图像分割代码解读前言概述代码框架代码细节分析划分数据集gen_dataset.py UNet模型训练u ...
类ChatGPT逐行代码解读(2/2)：从零起步实现ChatLLaMA和ColossalChat
本文为<类ChatGPT逐行代码解读>系列的第二篇,上一篇是:如何从零起步实现Transformer.ChatGLM 本文两个模型的特点是加了RLHF 第六部分 LLaMA的RLHF版:C ...
200行代码解读TDEngine背后的定时器
作者 | beyondma来源 | CSDN博客导读:最近几周,本文作者几篇有关陶建辉老师最新的创业项目-TdEngine代码解读文章出人意料地引起了巨大的反响,原以为C语言已经是昨日黄花,不过从读 ...
Unet论文解读代码解读
论文地址:http://www.arxiv.org/pdf/1505.04597.pdf 论文解读网络架构: a.U-net建立在FCN的网络架构上,作者修改并扩大了这个网络框架,使其能够使用很少 ...
基于反向代理的Web缓存应用-可缓存的CMS系统设计
基于反向代理的Web缓存加速 --可缓存的CMS系统设计作者: 车东 Email: chedongATbigfoot.com/chedongATchedong.com 写于:2003/05 ...
BERT：代码解读、实体关系抽取实战
目录前言一.BERT的主要亮点 1. 双向Transformers 2.句子级别的应用 3.能够解决的任务二.BERT代码解读 1. 数据预处理 1.1 InputExample类 1.2 In ...
实现一个基于动态代理的 AOP
实现一个基于动态代理的 AOP Intro 上次看基于动态代理的 AOP 框架实现,立了一个 Flag, 自己写一个简单的 AOP 实现示例,今天过来填坑了目前的实现是基于 Emit 来做的,后面有 ...
.NET 下基于动态代理的 AOP 框架实现揭秘
.NET 下基于动态代理的 AOP 框架实现揭秘 Intro 之前基于 Roslyn 实现了一个简单的条件解析引擎,想了解的可以看这篇文章基于 Roslyn 实现一个简单的条件解析引擎执行过程中会 ...

[代码解读]基于多代理RL的车联网频谱分享_Python实现