用深度强化学习玩超级马里奥兄弟

2021-08-19 12:17

介绍

从本文中，你将学习如何使用 Deep Q－Network 和 Double Deep Q－Network（带代码！）玩超级马里奥兄弟。

超级马里奥是任天堂在 1980 年代开发和发行的著名游戏。它是历经多年无需解释的经典游戏名称之一。这是一款2D横向卷轴游戏，让玩家可以控制主角——马里奥。游戏玩法包括从左到右移动马里奥，从反派中生存下来，获得硬币，以及到达旗帜以清除关卡。马里奥最终需要拯救公主。这些有不同的奖励系统、硬币、反派、漏洞和完成时间。游戏环境取自 OpenAI Gym，使用 Nintendo Entertainment System （NES） python 模拟器。在本文中，我将展示如何使用深度 Q 网络（DQN）和深度双 Q 网络（DDQN）算法和PyTorch 库来实现强化学习算法，以检查它们各自的性能。然后评估对每种算法进行的实验。数据理解和预处理超级马里奥兄弟的原始观察空间是 240 x 256 x 3 的 RGB 图像。动作空间是 256，这意味着能够采取 256 种不同的可能动作。为了加快我们模型的训练时间，我们使用了gym的包装器函数对原始环境应用了某些转换：在 4 帧上重复代理的每个动作并减小视频帧大小，即环境中的每个状态都是 4 x 84 x 84 x 1（4 个连续 84 x 84 灰度像素帧的列表）将像素值归一化到 0 到 1 的范围内将动作次数减少到 5（仅右）、7（简单动作）和 12（复杂动作）理论结果最初，我想使用 Q－learning 执行一个实验，该实验使用二维数组来存储状态和动作对值的所有可能组合。但是，在这种环境设置中，我意识到应用 Q－learning 是不可能的，因为需要存储非常大的 Q－table，而这是不可行的。因此，本项目使用 DQN 算法作为基线模型。DQN 算法使用 Q－learning 来学习在给定状态下采取的最佳动作，并使用深度神经网络来估计 Q 值函数。我使用的深度神经网络类型是一个 3 层卷积神经网络，后跟两个完全连接的线性层，每个可能的动作都有一个输出。该网络的工作原理类似于 Q－Learning 算法中的 Q－table。我们使用的目标损失函数是 Huber 损失或 Q 值的平滑平均绝对误差。Huber loss 结合了 MSE 和 MAE 来最小化目标函数。我们用来优化目标函数的优化器是 Adam。但是，DQN 网络存在高估的问题。

图 1：说明 DQN 网络如何被高估如图1所示，高估的主要原因有两个。第一个原因是由于用于计算目标值的最大化函数。假设action值为True，表示为：x（a?） … x（a?）。由 DQN 做出的噪声估计由 Q（s，a?；w），．．． Q（s， a?；w）表示，在数学上，

因此它高估了真实的 Q 值。第二个原因是高估的 Q 值再次被用于通过反向传播更新 Q 网络的权重。这使得高估更加严重。高估的主要缺点是由于 DQN 所做的非均匀高估。直观的感觉是，一个特定的状态、操作对在重放缓冲区中出现的频率越高，对该状态－操作对的高估就越高。为了获得更准确的 Q 值，我们想在我们的问题上使用 DDQN 网络，然后将实验结果与之前的 DQN 网络进行比较。为了减轻由最大化引起的高估，DDQN 使用 2 个 Q 网络，一个用于获取动作，另一个用于通过反向传播更新权重。DDQN Q－learning更新方程为：

Q＊用于更新权重，Q＾用于获取特定状态的动作。Q＾只是每 n 步复制 Q＊的值。实验结果使用 2 种算法 DQN 和 DDQN，基于智能体的不同运动进行了 5 次实验。不同的动作是复杂动作、简单动作和仅右动作。参数设置如下：观察空间：4 x 84 x 84 x 1动作空间：12（复杂动作）或7（简单动作）或5（仅右动作）损失函数：HuberLoss，δ ＝ 1优化器：Adam，lr ＝ 0．00025betas ＝（0．9， 0．999）批大小＝ 64 Dropout ＝ 0．2gamma ＝ 0．9体验回放的最大内存大小＝ 30000对于 epsilon greedy：探索衰减＝ 0．99，探索最小值＝ 0．05在探索开始时，max ＝ 1，代理将采取随机动作。在每一次动作之后，它将以探索衰减率衰减，直到达到 0．05 的探索最小值。实验一进行的第一个实验是比较 DDQN 和 DQN 算法用于智能体的复杂运动。

实验二进行的第二个实验是比较 DDQN 和 DQN 算法对于智能体的简单移动。

实验三进行的第三个实验是比较 DDQN 和 DQN 算法仅适用于代理的右运动。

从以上 3 个实验结果可以看出，在所有情况下，DQN 在第 10，000 集的性能与 DDQN 在第 2，000 集的性能大致相同。因此，我们可以得出结论，DDQN 网络有助于消除由 DQN 网络引起的高估问题。使用 DDQN 和 DQN 对 3 种不同运动进行了进一步的实验。实验四进行的第四个实验是在所有 3 个不同的动作上使用 DDQN 算法。

实验五进行的第五个实验是对所有 3 个不同的动作使用 DQN 算法

从以上 2 个实验结果，我们可以得出结论，该网络能够在仅允许代理在仅右运动的动作空间上进行更好的训练。代码import torch

import torch．nn as nn

import random

from nes＿py．wrappers import JoypadSpace

import gym＿super＿mario＿bros

from tqdm import tqdm

import pickle

from gym＿super＿mario＿bros．actions import RIGHT＿ONLY， SIMPLE＿MOVEMENT， COMPLEX＿MOVEMENT

import gym

import numpy as np

import collections

import cv2

import matplotlib．pyplot as plt

％matplotlib inline

import time

import pylab as pl

from IPython import display

class MaxAndSkipEnv（gym．Wrapper）：

＂＂＂

Each action of the agent is repeated over skip frames

return only every ｀skip｀－th frame

＂＂＂

def ＿＿init＿＿（self， env＝None， skip＝4）：

super（MaxAndSkipEnv， self）．＿＿init＿＿（env）
＃ most recent raw observations （for max pooling across time steps）
self．＿obs＿buffer ＝ collections．deque（maxlen＝2）
self．＿skip ＝ skip
def step（self， action）：
total＿reward ＝ 0．0
done ＝ None
for ＿ in range（self．＿skip）：
obs， reward， done， info ＝ self．env．step（action）
self．＿obs＿buffer．append（obs）
total＿reward ＋＝ reward
if done：
break
max＿frame ＝ np．max（np．stack（self．＿obs＿buffer）， axis＝0）
return max＿frame， total＿reward， done， info
def reset（self）：
＂＂＂Clear past frame buffer and init to first obs＂＂＂
self．＿obs＿buffer．clear（）
obs ＝ self．env．reset（）
self．＿obs＿buffer．append（obs）
return obs
class MarioRescale84x84（gym．ObservationWrapper）：
＂＂＂
Downsamples／Rescales each frame to size 84x84 with greyscale
＂＂＂
def ＿＿init＿＿（self， env＝None）：
super（MarioRescale84x84， self）．＿＿init＿＿（env）
self．observation＿space ＝ gym．spaces．Box（low＝0， high＝255， shape＝（84， 84， 1）， dtype＝np．uint8）
def observation（self， obs）：
return MarioRescale84x84．process（obs）
＠staticmethod
def process（frame）：
if frame．size ＝＝ 240 ＊ 256 ＊ 3：
img ＝ np．reshape（frame，［240， 256， 3］）．astype（np．float32）
else：
assert False，＂Unknown resolution．＂
＃ image normalization on RBG
img ＝ img［：，：， 0］＊ 0．299 ＋ img［：，：， 1］＊ 0．587 ＋ img［：，：， 2］＊ 0．114
resized＿screen ＝ cv2．resize（img，（84， 110）， interpolation＝cv2．INTER＿AREA）
x＿t ＝ resized＿screen［18：102，：］
x＿t ＝ np．reshape（x＿t，［84， 84， 1］）
return x＿t．astype（np．uint8）
class ImageToPyTorch（gym．ObservationWrapper）：
＂＂＂
Each frame is converted to PyTorch tensors
＂＂＂
def ＿＿init＿＿（self， env）：
super（ImageToPyTorch， self）．＿＿init＿＿（env）
old＿shape ＝ self．observation＿space．shape
self．observation＿space ＝ gym．spaces．Box（low＝0．0， high＝1．0， shape＝（old＿shape［－1］， old＿shape［0］， old＿shape［1］）， dtype＝np．float32）
def observation（self， observation）：
return np．moveaxis（observation， 2， 0）

class BufferWrapper（gym．ObservationWrapper）：
＂＂＂
Only every k－th frame is collected by the buffer
＂＂＂
def ＿＿init＿＿（self， env， n＿steps， dtype＝np．float32）：
super（BufferWrapper， self）．＿＿init＿＿（env）
self．dtype ＝ dtype
old＿space ＝ env．observation＿space
self．observation＿space ＝ gym．spaces．Box（old＿space．low．repeat（n＿steps， axis＝0），
old＿space．high．repeat（n＿steps， axis＝0）， dtype＝dtype）
def reset（self）：
self．buffer ＝ np．zeros＿like（self．observation＿space．low， dtype＝self．dtype）
return self．observation（self．env．reset（））
def observation（self， observation）：
self．buffer［：－1］＝ self．buffer［1：］
self．buffer［－1］＝ observation
return self．buffer
class PixelNormalization（gym．ObservationWrapper）：
＂＂＂
Normalize pixel values in frame －－＞ 0 to 1
＂＂＂
def observation（self， obs）：
return np．array（obs）．astype（np．float32）／ 255．0
def create＿mario＿env（env）：
env ＝ MaxAndSkipEnv（env）
env ＝ MarioRescale84x84（env）
env ＝ ImageToPyTorch（env）
env ＝ BufferWrapper（env， 4）
env ＝ PixelNormalization（env）
return JoypadSpace（env， SIMPLE＿MOVEMENT）

class DQNSolver（nn．Module）：
＂＂＂
Convolutional Neural Net with 3 conv layers and two linear layers
＂＂＂
def ＿＿init＿＿（self， input＿shape， n＿actions）：
super（DQNSolver， self）．＿＿init＿＿（）
self．conv ＝ nn．Sequential（
nn．Conv2d（input＿shape［0］， 32， kernel＿size＝8， stride＝4），
nn．ReLU（），
nn．Conv2d（32， 64， kernel＿size＝4， stride＝2），
nn．ReLU（），
nn．Conv2d（64， 64， kernel＿size＝3， stride＝1），
nn．ReLU（）
）
conv＿out＿size ＝ self．＿get＿conv＿out（input＿shape）
self．fc ＝ nn．Sequential（
nn．Linear（conv＿out＿size， 512），
nn．ReLU（），
nn．Linear（512， n＿actions）
）

def ＿get＿conv＿out（self， shape）：
o ＝ self．conv（torch．zeros（1，＊shape））
return int（np．prod（o．size（）））
def forward（self， x）：
conv＿out ＝ self．conv（x）．view（x．size（）［0］，－1）
return self．fc（conv＿out）
class DQNAgent：
def ＿＿init＿＿（self， state＿space， action＿space， max＿memory＿size， batch＿size， gamma， lr，
dropout， exploration＿max， exploration＿min， exploration＿decay， double＿dqn， pretrained）：
＃ Define DQN Layers
self．state＿space ＝ state＿space
self．action＿space ＝ action＿space
self．double＿dqn ＝ double＿dqn
self．pretrained ＝ pretrained
self．device ＝＇cuda＇ if torch．cuda．is＿available（） else ＇cpu＇

＃ Double DQN network
if self．double＿dqn：
self．local＿net ＝ DQNSolver（state＿space， action＿space）．to（self．device）
self．target＿net ＝ DQNSolver（state＿space， action＿space）．to（self．device）

if self．pretrained：
self．local＿net．load＿state＿dict（torch．load（＂DQN1．pt＂， map＿location＝torch．device（self．device）））
self．target＿net．load＿state＿dict（torch．load（＂DQN2．pt＂， map＿location＝torch．device（self．device）））

self．optimizer ＝ torch．optim．Adam（self．local＿net．parameters（）， lr＝lr）
self．copy ＝ 5000 ＃ Copy the local model weights into the target network every 5000 steps
self．step ＝ 0
＃ DQN network
else：
self．dqn ＝ DQNSolver（state＿space， action＿space）．to（self．device）

if self．pretrained：
self．dqn．load＿state＿dict（torch．load（＂DQN．pt＂， map＿location＝torch．device（self．device）））
self．optimizer ＝ torch．optim．Adam（self．dqn．parameters（）， lr＝lr）
＃ Create memory
self．max＿memory＿size ＝ max＿memory＿size
if self．pretrained：
self．STATE＿MEM ＝ torch．load（＂STATE＿MEM．pt＂）
self．ACTION＿MEM ＝ torch．load（＂ACTION＿MEM．pt＂）
self．REWARD＿MEM ＝ torch．load（＂REWARD＿MEM．pt＂）
self．STATE2＿MEM ＝ torch．load（＂STATE2＿MEM．pt＂）
self．DONE＿MEM ＝ torch．load（＂DONE＿MEM．pt＂）
with open（＂ending＿position．pkl＂，＇rb＇） as f：
self．ending＿position ＝ pickle．load（f）
with open（＂num＿in＿queue．pkl＂，＇rb＇） as f：
self．num＿in＿queue ＝ pickle．load（f）
else：
self．STATE＿MEM ＝ torch．zeros（max＿memory＿size，＊self．state＿space）
self．ACTION＿MEM ＝ torch．zeros（max＿memory＿size， 1）
self．REWARD＿MEM ＝ torch．zeros（max＿memory＿size， 1）
self．STATE2＿MEM ＝ torch．zeros（max＿memory＿size，＊self．state＿space）
self．DONE＿MEM ＝ torch．zeros（max＿memory＿size， 1）
self．ending＿position ＝ 0
self．num＿in＿queue ＝ 0

self．memory＿sample＿size ＝ batch＿size

＃ Learning parameters
self．gamma ＝ gamma
self．l1 ＝ nn．SmoothL1Loss（）．to（self．device）＃ Also known as Huber loss
self．exploration＿max ＝ exploration＿max
self．exploration＿rate ＝ exploration＿max
self．exploration＿min ＝ exploration＿min
self．exploration＿decay ＝ exploration＿decay
def remember（self， state， action， reward， state2， done）：
＂＂＂Store the experiences in a buffer to use later＂＂＂
self．STATE＿MEM［self．ending＿position］＝ state．float（）
self．ACTION＿MEM［self．ending＿position］＝ action．float（）
self．REWARD＿MEM［self．ending＿position］＝ reward．float（）
self．STATE2＿MEM［self．ending＿position］＝ state2．float（）
self．DONE＿MEM［self．ending＿position］＝ done．float（）
self．ending＿position ＝（self．ending＿position ＋ 1）％ self．max＿memory＿size ＃ FIFO tensor
self．num＿in＿queue ＝ min（self．num＿in＿queue ＋ 1， self．max＿memory＿size）

def batch＿experiences（self）：
＂＂＂Randomly sample ＇batch size＇ experiences＂＂＂
idx ＝ random．choices（range（self．num＿in＿queue）， k＝self．memory＿sample＿size）
STATE ＝ self．STATE＿MEM［idx］
ACTION ＝ self．ACTION＿MEM［idx］
REWARD ＝ self．REWARD＿MEM［idx］
STATE2 ＝ self．STATE2＿MEM［idx］
DONE ＝ self．DONE＿MEM［idx］
return STATE， ACTION， REWARD， STATE2， DONE

def act（self， state）：
＂＂＂Epsilon－greedy action＂＂＂
if self．double＿dqn：
self．step ＋＝ 1
if random．random（）＜ self．exploration＿rate：
return torch．tensor（［［random．randrange（self．action＿space）］］）
if self．double＿dqn：
＃ Local net is used for the policy
return torch．argmax（self．local＿net（state．to（self．device）））．unsqueeze（0）．unsqueeze（0）．cpu（）
else：
return torch．argmax（self．dqn（state．to（self．device）））．unsqueeze（0）．unsqueeze（0）．cpu（）

def copy＿model（self）：
＂＂＂Copy local net weights into target net for DDQN network＂＂＂
self．target＿net．load＿state＿dict（self．local＿net．state＿dict（））

def experience＿replay（self）：
＂＂＂Use the double Q－update or Q－update equations to update the network weights＂＂＂
if self．double＿dqn and self．step ％ self．copy ＝＝ 0：
self．copy＿model（）
if self．memory＿sample＿size ＞ self．num＿in＿queue：
return

＃ Sample a batch of experiences
STATE， ACTION， REWARD， STATE2， DONE ＝ self．batch＿experiences（）
STATE ＝ STATE．to（self．device）
ACTION ＝ ACTION．to（self．device）
REWARD ＝ REWARD．to（self．device）
STATE2 ＝ STATE2．to（self．device）
DONE ＝ DONE．to（self．device）

self．optimizer．zero＿grad（）
if self．double＿dqn：
＃ Double Q－Learning target is Q＊（S， A）＜－ r ＋ γ max＿a Q＿target（S＇， a）
target ＝ REWARD ＋ torch．mul（（self．gamma ＊ self．target＿net（STATE2）．max（1）．values．unsqueeze（1））， 1 － DONE）
current ＝ self．local＿net（STATE）．gather（1， ACTION．long（））＃ Local net approximation of Q－value
else：
＃ Q－Learning target is Q＊（S， A）＜－ r ＋ γ max＿a Q（S＇， a）
target ＝ REWARD ＋ torch．mul（（self．gamma ＊ self．dqn（STATE2）．max（1）．values．unsqueeze（1））， 1 － DONE）

current ＝ self．dqn（STATE）．gather（1， ACTION．long（））

loss ＝ self．l1（current， target）
loss．backward（）＃ Compute gradients
self．optimizer．step（）＃ Backpropagate error
self．exploration＿rate ＊＝ self．exploration＿decay

＃ Makes sure that exploration rate is always at least ＇exploration min＇
self．exploration＿rate ＝ max（self．exploration＿rate， self．exploration＿min）

def show＿state（env， ep＝0， info＝＂＂）：
＂＂＂While testing show the mario playing environment＂＂＂
plt．figure（3）
plt．clf（）
plt．imshow（env．render（mode＝＇rgb＿array＇））
plt．title（＂Episode：％d ％s＂％（ep， info））
plt．axis（＇off＇）
display．clear＿output（wait＝True）
display．display（plt．gcf（））

def run（training＿mode， pretrained， double＿dqn， num＿episodes＝1000， exploration＿max＝1）：

env ＝ gym＿super＿mario＿bros．make（＇SuperMarioBros－1－1－v0＇）
env ＝ create＿mario＿env（env）＃ Wraps the environment so that frames are grayscale
observation＿space ＝ env．observation＿space．shape
action＿space ＝ env．action＿space．n
agent ＝ DQNAgent（state＿space＝observation＿space，
action＿space＝action＿space，
max＿memory＿size＝30000，
batch＿size＝32，
gamma＝0．90，
lr＝0．00025，
dropout＝0．2，
exploration＿max＝1．0，
exploration＿min＝0．02，
exploration＿decay＝0．99，
double＿dqn＝double＿dqn，
pretrained＝pretrained）

＃ Restart the enviroment for each episode
num＿episodes ＝ num＿episodes
env．reset（）

total＿rewards ＝［］
if training＿mode and pretrained：
with open（＂total＿rewards．pkl＂，＇rb＇） as f：
total＿rewards ＝ pickle．load（f）

for ep＿num in tqdm（range（num＿episodes））：
state ＝ env．reset（）
state ＝ torch．Tensor（［state］）
total＿reward ＝ 0
steps ＝ 0
while True：
if not training＿mode：
show＿state（env， ep＿num）
action ＝ agent．act（state）
steps ＋＝ 1

state＿next， reward， terminal， info ＝ env．step（int（action［0］））
total＿reward ＋＝ reward
state＿next ＝ torch．Tensor（［state＿next］）
reward ＝ torch．tensor（［reward］）．unsqueeze（0）

terminal ＝ torch．tensor（［int（terminal）］）．unsqueeze（0）

if training＿mode：
agent．remember（state， action， reward， state＿next， terminal）
agent．experience＿replay（）

state ＝ state＿next
if terminal：
break

total＿rewards．append（total＿reward）

if ep＿num ！＝ 0 and ep＿num ％ 100 ＝＝ 0：
print（＂Episode ｛｝ score ＝｛｝， average score ＝｛｝＂．format（ep＿num ＋ 1， total＿rewards［－1］， np．mean（total＿rewards）））
num＿episodes ＋＝ 1
print（＂Episode ｛｝ score ＝｛｝， average score ＝｛｝＂．format（ep＿num ＋ 1， total＿rewards［－1］， np．mean（total＿rewards）））

＃ Save the trained memory so that we can continue from where we stop using ＇pretrained＇＝ True
if training＿mode：
with open（＂ending＿position．pkl＂，＂wb＂） as f：
pickle．dump（agent．ending＿position， f）
with open（＂num＿in＿queue．pkl＂，＂wb＂） as f：
pickle．dump（agent．num＿in＿queue， f）
with open（＂total＿rewards．pkl＂，＂wb＂） as f：
pickle．dump（total＿rewards， f）
if agent．double＿dqn：
torch．save（agent．local＿net．state＿dict（），＂DQN1．pt＂）
torch．save（agent．target＿net．state＿dict（），＂DQN2．pt＂）
else：
torch．save（agent．dqn．state＿dict（），＂DQN．pt＂）
torch．save（agent．STATE＿MEM，＂STATE＿MEM．pt＂）
torch．save（agent．ACTION＿MEM，＂ACTION＿MEM．pt＂）
torch．save（agent．REWARD＿MEM，＂REWARD＿MEM．pt＂）
torch．save（agent．STATE2＿MEM，＂STATE2＿MEM．pt＂）
torch．save（agent．DONE＿MEM，＂DONE＿MEM．pt＂）

env．close（）