深度学习之RNN循环神经网络（理论+图解+Python代码部分）

阅读原文时间：2021年04月25日阅读：1

前言

前段时间实验室人手一本《Deep Learning》，本文章结合这本圣经和博客上各类知识以及我自己的理解和实践，针对RNN循环神经网络作出总结。

RNN（Recurrent Neural Network）循环神经网络是一类用于处理序列数据的神经网络。首先我们要明确什么是序列数据，摘取百度百科词条：时间序列数据是指在不同时间点上收集到的数据，这类数据反映了某一事物、现象等随时间的变化状态或程度。这是时间序列数据的定义，当然这里也可以不是时间，比如文字序列，但总归序列数据有一个特点——后面的数据跟前面的数据有关系。

一、普通神经网络

在介绍RNN之前先回顾什么是普通神经网络：把每一个元素节点作为计算节点，边长作为权重，打个简单的比分：x向量作为输入，输出y比如0.5,0.7，与真实情况的y0,1对比，把损失算出来，对权重求导，乘上learningrate，，对前面的权重进行更新，w1和w1’的关系就是一段位移，更新后0.2,0.9差一点，就继续更新，直到跟真实的y一模一样（一点一点下来梯度下降的意思）。以上是普通神经网络以一般过程。这个BP反向传播算法的详细过程可参考：

https://blog.csdn.net/weixin_39441762/article/details/80446692

二、循环神经网络

那现在我们发现一件事儿，在普通神经网络里，x1,x2,x3是单独处理的，没有顺序关系，也就是说输入换了位置，输出也是一样的，那么信息之间的一些特征是否会在时间序列上得到体现，我们是否能够提取出时间序列的一些关系来更好的表达元数据本身。

如果我们想要处理文字或者语音的问题呢，我们需要把它看成一个连续的整体，这里呢我们就给这个神经网络加上一个反馈回路（见下图），这个回路可以把上一个时间输出的信息作为下一个时间的输入来进行处理。比如说这里有一段文字，先把它分成一个个的词，然后在把这一段词输入到这个神经网络中，第一次先输入第一个词x0，输出h0，中间隐层还有个雁过留声的信息传到下一个时间，当第二个词x1传进来的时候会结合上一个词输出的结果进行一个综合的判断，然继续输出继续传到x2…，这样会把语音或者文字这样一个序列的数据一个一个输进来进行综合的判断，它本质上来说还是一个bp神经网络，但是他们之间的一个区别呢就是这样一个反馈回路（记住上一次的输出+这一次的输入）来帮助决策。

RNN就是在x1学习完走开后雁过留声，留下自己一点点的残余遗迹，x2来的时候受一点x1残余的影响进行学习，x3来的时候受x1和x2残余的影响……使得结果与普通神经网络有点不一样。

1.前向传播

有了这样一个大概的概念之后呢我们看一下前向的计算方法

RNN和HMM在时间上的思维有一点像（就是受之前数据的影响）。多一个自我循环的过程（多一个记忆存储点）看下这个传播过程。下面st方程里，第一项是（input的量）输入，第二项是上一个隐层记忆的量，f是各种方程的代表，通常会用tanh函数，ot方程里是此刻记忆的输出。

如果时间步T=3也就是说一个句子中有一共有3个词，它们作为输入依次进入这个网络，就相当于把一个神经元拉伸成3个，s就是我们说的记忆，因为它把前面的信息都记录下来了。

2.反向传播算法BPTT

将RNN展开之后，前向传播就是依次按照时间的顺序计算一次就好了，反向传播就是从最后一个时间将累积的损失传递回来即可，这与普通的神经网络训练本质上是相似的。

即通过梯度下降法一轮轮的迭代，得到合适的RNN模型参数U,W,V,b,c。由于我们是基于时间反向传播，所以RNN的反向传播有时也叫做BPTT(back-propagation through time)。当然这里的BPTT和普通神经网络也有很大的不同点，即这里所有的U,W,V,b,c在序列的各个位置是共享的，反向传播时我们更新的是相同的参数。

如图，计算图的节点包括U、V、W、b、c，以及以t为索引节点序列对于每一个节点N，我们需要基于N后面的节点的梯度，递归的计算梯度.

我们从序列的末尾开始，反向进行计算。在最后的时间步T，只有作为后续节点，因此这个梯度很简单

然后，我们可以从时刻t=T-1到t=1反向迭代，通过时间反向传播梯度，注意同时具有两个后续节点，因此，它的梯度由下式计算

我们可以得到关于参数节点的梯度如下：

这里我们假设损失函数L为给定的x1,x2,…xt后yt的负对数似然，在隐层激活函数选择tanh，在输出层激活函数选择softmax，那么反向传播的时候梯度的结果可以表示为上述公式等式的最右边。

注意：每一个时间步都是可以输出当前时序 t 的隐状态 s和当前时刻的损失。但整体RNN的输出 o 是在最后一个时间步获取的，这才是完整的最终结果。所以在反向传播的时候，我们用到的损失函数是最后一个时间步的损失来反向传播。

3.自然语言处理之RNN

这里上一张图来帮助大家理解RNN循环神经网络：

这里有一句自然语言比如：“今天天气很好”，生成词向量后按照图中的箭头时间顺序依次输入到RNN当中，假设这个词向量的维度是8维，“很”=（0,1,0,0,0,0,0,0），中间隐层的神经元有6个。那么我们可以看到“很”是“天气”的下一时间的输入，“很”的隐层受到“很”的输入数据的影响，以及“天气”的隐层数据的影响。在这三个红圈中的反向传播里，我们更新的参数有：6*6+6*8+6=90个，也就是权重项和偏置项（输出的权重这里没考虑）。由于前面也提到，U,W,V,b,c在序列的各个位置是共享的，所以这里不用乘以4.

---------------------------------------------------------------------------------------------------------------------------------

流程如下：

Step 1，一段文本：

Step 2，中文分词：

Step 3，词典化：

sentence1: 1 34 21 98 10 23 9 23

sentence2: 17 12 21 12 8 10 13 79 31 44 923

……

Step 4，将每个句子加到固定长度

sentence1: 1 34 21 98 10 23 9 23 0 0 0 0 0

sentence2: 17 12 21 12 8 10 13 79 31 44 923 0

Step 5，将文本词序列转换为 word embedding 序列（后面再解释）

sentence1: ,每一列代表一个词向量，词向量维度自行确定；矩阵列数固定为time_step length。
sentence2:
……

Step 6，喂给RNN作为输入

假设一个RNN的时间步确定为L，一次RNNs的run，连续地将整个sentence处理完。

Step 7，得到输出

看图，每个time_step都是可以输出当前时序 t 的隐状态s；但整体RNN的输出o 是在最后一个时间步=L 时获取，才是完整的最终结果。

Step 8，可以进一步处理了

我们可以将output根据分类任务或回归拟合任务的不同，分别进一步处理。比如，传给cross_entropy&softmax进行分类……

---------------------------------------------------------------------------------------------------------------------------------

简单提一下，一般会添加嵌入层

在one hot编码的每个单词都是一个维度，彼此independent。然而每个单词彼此无关这个特点明显不符合我们的现实情况。我们知道大量的单词都是有关

语义：girl和woman虽然用在不同年龄上，但指的都是女性。

复数：word和words仅仅是复数和单数的差别。

时态：buy和bought表达的都是“买”，但发生的时间不同。

所以用one hot representation的编码方式，上面的特性都没有被考虑到。

EmbeddingLayer嵌入层具有降维的作用。输入到网络的向量常常是非常高的维度的one-hot vector，比如8000维，只有一个index是1，其余位置都是0，非常稀疏的向量（高维稀疏向量）。Embedding后找到一个映射或者函数，生成在一个新的空间上的表达，也就是词映射到低维连续向量。可以将其降到比如100维度的空间下进行运算（低维连续向量）。有个特点就是词向量“女人”与“男人”的距离约等于于 “阿姨”与“叔叔”距离。

4.梯度问题

了解RNN大概过程后，这里提出一个神经网络存在的的经典问题，梯度问题

4.1梯度爆炸

如何确定是否出现梯度爆炸？

训练过程中出现梯度爆炸会伴随一些细微的信号，如：

模型无法从训练数据中获得更新（如低损失）。

模型不稳定，导致更新过程中的损失出现显著变化。

训练过程中，模型损失变成NaN。

解决方案：

梯度剪切这个方案主要是针对梯度爆炸提出的，其思想是设置一个梯度剪切阈值，然后更新梯度的时候，如果梯度超过这个阈值，那么就将其强制限制在这个范围之内。这可以防止梯度爆炸。

权重正则化（weithts regularization）比较常见的是l1l1正则，和l2l2正则，在各个深度框架中都有相应的API可以使用正则化，

relu、leakrelu、elu等激活函数

注：事实上，在深度神经网络中，往往是梯度消失出现的更多一些。

4.2梯度消失：

由前文可见，RNN可以带上记忆。假设，一个生成下一个单词的例子：“这顿饭真好”->“吃”，很明显，我们只要前5个字就能猜到下一个字是什么了。当只有一个好的时候，没办法知道是好吃还是好玩还是好什么，只有当记忆连带了5个历史的是时候，这顿饭真、、、，知道了饭才知道好后面应该跟着吃，这就是为什么rnn带上记忆之后能在文本生成上进行一个很好的处理。

不过这里还有一个问题就是，如果这个长度要够长的话你才能知道前文信息有多少，只有足够长足够长才能存储够多的记忆，比如有个人跟你聊天聊了一整天，上午的时候问了你一个笑话，下午的时候问你，诶穿山甲说了什么？能回答吗，或者说几天之后问的的呢，显然不能回答。如果说这个rnn不能记忆到几天前的s的话，那么其实这个处理能力时为零的，因为它还是不知道你的上下文前提是什么。也就是说，它和bp一样，都有一个梯度消失的问题，很难学习到长期的依赖，随着传播和时间的流逝不断的衰减，第一次传入的时候对决策的影响较大，传到第二个的时候变的比较小，第三个更小，经过5到6次的传播，对决策基本上没什么作用了。

在求梯度的时候，矩阵中存在比较小的值，多个矩阵相乘会使梯度值以指数级速度下降，最终在几步后完全消失。比较远的时刻的梯度值为0，这些时刻的状态对学习过程没有帮助，导致你无法学习到长距离依赖。消失梯度问题不仅出现在RNN中，同样也出现在深度前向神经网络中。

这时候需要一个强力的伙伴就是LSTM。

LSTM的过程我在下一篇里会说明。

5.Keras建立RNN模型进行IMDb情感分析的Python代码

IMDb数据集共有50000项“影评文字”，分为训练数据与测试数据各25000项，每一项“影评文字”都被标记为“正面评价”或“负面评价”。我们希望能建立一个模型，经过大量“影评文字”训练后，此模型可以用于预测“影评文字”是“正面评价”或“负面评价”。

#####数据准备
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

import re
re_tag = re.compile(r'<[^>]+>')

def rm_tags(text):
    return re_tag.sub('', text)

import os
def read_files(filetype):
    path = "data/aclImdb/"
    file_list=[]

    positive_path=path + filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]

    negative_path=path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]

    print('read',filetype, 'files:',len(file_list))

    all_labels = ([1] * 12500 + [0] * 12500) 

    all_texts  = []

    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(" ".join(file_input.readlines()))]

    return all_labels,all_texts

y_train,train_text=read_files("train")
y_test,test_text=read_files("test")
token = Tokenizer(num_words=3800)
token.fit_on_texts(train_text)
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq  = token.texts_to_sequences(test_text)
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test  = sequence.pad_sequences(x_test_seq,  maxlen=380)

#####建立模型
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN

model = Sequential()

model.add(Embedding(output_dim=32,
                    input_dim=3800, 
                    input_length=380))
model.add(Dropout(0.35))

model.add(SimpleRNN(units=16))

model.add(Dense(units=256,activation='relu' ))

model.add(Dropout(0.35))

model.add(Dense(units=1,activation='sigmoid' ))

model.summary()

#####训练模型
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

train_history =model.fit(x_train, y_train,batch_size=100, 
                         epochs=10,verbose=2,
                         validation_split=0.2)

import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

show_train_history(train_history,'acc','val_acc')
show_train_history(train_history,'loss','val_loss')

#####评估模型的准确率
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]

#####预测概率
probility=model.predict(x_test)
probility[:10]

for p in probility[12500:12510]:
    print(p)

#####预测结果
predict=model.predict_classes(x_test)

predict[:10]
predict.shape
predict_classes=predict.reshape(25000)
predict_classes

#####查看预测结果
SentimentDict={1:'正面的',0:'负面的'}
def display_test_Sentiment(i):
    print(test_text[i])
    print('label真实值:',SentimentDict[y_test[i]],
          '预测结果:',SentimentDict[predict_classes[i]])

display_test_Sentiment(2)
'''
注：以下是程序输出（不包括此句）
As a recreational golfer with some knowledge of the sport's history, I was pleased with Disney's sensitivity to the issues of class in golf in the early twentieth century. The movie depicted well the psychological battles that Harry Vardon fought within himself, from his childhood trauma of being evicted to his own inability to break that glass ceiling that prevents him from being accepted as an equal in English golf society. Likewise, the young Ouimet goes through his own class struggles, being a mere caddie in the eyes of the upper crust Americans who scoff at his attempts to rise above his standing. What I loved best, however, is how this theme of class is manifested in the characters of Ouimet's parents. His father is a working-class drone who sees the value of hard work but is intimidated by the upper class; his mother, however, recognizes her son's talent and desire and encourages him to pursue his dream of competing against those who think he is inferior.Finally, the golf scenes are well photographed. Although the course used in the movie was not the actual site of the historical tournament, the little liberties taken by Disney do not detract from the beauty of the film. There's one little Disney moment at the pool table; otherwise, the viewer does not really think Disney. The ending, as in "Miracle," is not some Disney creation, but one that only human history could have written.
label真实值: 正面的 预测结果: 正面的
'''

display_test_Sentiment(3)
'''
注：以下是程序输出（不包括此句）
I saw this film in a sneak preview, and it is delightful. The cinematography is unusually creative, the acting is good, and the story is fabulous. If this movie does not do well, it won't be because it doesn't deserve to. Before this film, I didn't realize how charming Shia Lebouf could be. He does a marvelous, self-contained, job as the lead. There's something incredibly sweet about him, and it makes the movie even better. The other actors do a good job as well, and the film contains moments of really high suspense, more than one might expect from a movie about golf. Sports movies are a dime a dozen, but this one stands out. This is one I'd recommend to anyone.
label真实值: 正面的 预测结果: 正面的
'''
predict_classes[12500:12510]
'''
注：以下是程序输出（不包括此句）
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0])
'''
display_test_Sentiment(12502)
'''
注：以下是程序输出（不包括此句）
First of all I hate those moronic rappers, who could'nt act if they had a gun pressed against their foreheads. All they do is curse and shoot each other and acting like cliché'e version of gangsters.The movie doesn't take more than five minutes to explain what is going on before we're already at the warehouse There is not a single sympathetic character in this movie, except for the homeless guy, who is also the only one with half a brain.Bill Paxton and William Sadler are both hill billies and Sadlers character is just as much a villain as the gangsters. I did'nt like him right from the start.The movie is filled with pointless violence and Walter Hills specialty: people falling through windows with glass flying everywhere. There is pretty much no plot and it is a big problem when you root for no-one. Everybody dies, except from Paxton and the homeless guy and everybody get what they deserve.The only two black people that can act is the homeless guy and the junkie but they're actors by profession, not annoying ugly brain dead rappers.Stay away from this crap and watch 48 hours 1 and 2 instead. At lest they have characters you care about, a sense of humor and nothing but real actors in the cast.
label真实值: 负面的 预测结果: 负面的
'''

#预测新的影评
input_text='''
I can't vote because I have not watched this movie yet. I've been wanting to watch this movie since the time they announced making it which is about 2 years ago (!)
I was planning to go with the family to see the anticipated movie but my nieces had school exams at the opening time so we all decided to wait for the next weekend. I was utterly shocked to learn yesterday that they pulled the movie from the Kuwaiti theaters "temporarily" so that the outrageous censorship system can remove some unwanted scenes.
The controversial gay "moment" according to my online research is barely there, so I can't find any logical reason for all the fuss that's been going on. And it was bad enough when fanatics and haters tried (in vain) to kill the movie with low ratings and negative reviews even before it was in the cinemas and I'm pretty sure most of those trolls never got the chance to watch the movie at that time.
Based on the trailers, I think the movie is very promising and entertaining and you can't simply overlook the tremendous efforts made to bring this beloved tale to life. To knock down hundreds of people's obvious hard work with unprofessional critique and negative reviews just for the sake of hatred is unfathomable. I hope people won't judge a movie before having the experience of watching it in the first place.
Impatiently waiting for the Kuwaiti cinemas to bring back the movie... 
'''
input_seq = token.texts_to_sequences([input_text])
pad_input_seq  = sequence.pad_sequences(input_seq , maxlen=380)
predict_result=model.predict_classes(pad_input_seq)
SentimentDict[predict_result[0][0]]
'''
'负面的'
'''

def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq  = sequence.pad_sequences(input_seq , maxlen=380)
    predict_result=model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

predict_review('''
As a fan of the original Disney film (Personally I feel it's their masterpiece) I was taken aback to the fact that a new version was in the making. Still excited I had high hopes for the film. Most of was shattered in the first 10 minutes. Campy acting with badly performed singing starts off a long journey holding hands with some of the worst CGI Hollywood have managed to but to screen in ages.
A film that is over 50% GCI, should focus on making that part believable, unfortunately for this film, it's far from that. It looks like the original film was ripped apart frame by frame and the beautiful hand-painted drawings have been replaced with digital caricatures. Besides CGI that is bad, it's mostly creepy. As the little teacup boy will give me nightmares for several nights to come. Emma Watson plays the same character as she always does, with very little acting effort and very little conviction as Belle. Although I can see why she was cast in the film based on merits, she is far from the right choice for the role. Dan Stevens does alright under as some motion captured dead-eyed Beast, but his performance feels flat as well. Luke Evans makes for a great pompous Gaston, but a character that has little depth doesn't really make for a great viewing experience. Josh Gad is a great comic relief just like the original movie's LeFou. Other than that, none of the cast stands out enough for me to remember them. Human or CHI creature. I was just bored through out the whole experience. And for a project costing $160 000 000, I can see why the PR department is pushing it so hard because they really need to get some cash back on this pile of wet stinky CGI-fur!
All and all, I might be bias from really loving Disney's first adaptation. That for me marks the high-point of all their work, perfectly combining the skills of their animators along with some CGI in a majestic blend. This film however is more like the bucket you wash off your paintbrush in, it has all the same colors, but muddled with water and to thin to make a captivating story from. The film is quite frankly not worth your time, you would be better off watching the original one more time. 
''')
'''
注：以下是程序输出（不包括此句）
1/1 [==============================] - 0s
正面的
'''

predict_review('''
The original Beauty and the Beast was my favorite cartoon as a kid but it did have major plot holes. Why had no one else ever seen the castle or knew where it was? Didn't anyone miss the people who were cursed? All of that gets an explanation when the enchantress places her curse in the beginning. Why did Belle and her Father move to a small town? Her mother died and the father thought it as best to leave. I love the new songs and added lyrics to the originals. I like the way the cgi beast looks (just the face is CGi). I think Emma Watson is a perfect Belle who is outspoken, fearless, and different. The set design is perfect for the era in France.
I know a lot of people disagree but I found this remake with all its changes to be more enchanting, beautiful, and complete than the original 1991 movie. To each his own but I think everyone should see it for themselves. 
''')
'''
注：以下是程序输出（不包括此句）
1/1 [==============================] - 0s
正面的
'''

手机扫一扫

移动阅读更方便