使用numpy实现bert模型,使用hugging face 或pytorch训练模型,保存参数为numpy格式,然后使用numpy加载模型推理,可在树莓派上运行
阅读原文时间:2023年08月28日阅读:1

之前分别用numpy实现了mlp,cnn,lstm,这次搞一个大一点的模型bert,纯numpy实现,最重要的是可在树莓派上或其他不能安装pytorch的板子上运行,推理数据

本次模型是随便在hugging face上找的一个新闻评论的模型,7分类

看这些模型参数,这并不重要,模型占硬盘空间都要400+M

bert.embeddings.word_embeddings.weight torch.Size([21128, 768])
bert.embeddings.position_embeddings.weight torch.Size([512, 768])
bert.embeddings.token_type_embeddings.weight torch.Size([2, 768])
bert.embeddings.LayerNorm.weight torch.Size([768])
bert.embeddings.LayerNorm.bias torch.Size([768])
bert.encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.0.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.0.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.0.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.0.output.dense.bias torch.Size([768])
bert.encoder.layer.0.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.0.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.1.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.query.bias torch.Size([768])
bert.encoder.layer.1.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.key.bias torch.Size([768])
bert.encoder.layer.1.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.self.value.bias torch.Size([768])
bert.encoder.layer.1.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.1.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.1.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.1.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.1.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.1.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.1.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.1.output.dense.bias torch.Size([768])
bert.encoder.layer.1.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.1.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.2.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.self.query.bias torch.Size([768])
bert.encoder.layer.2.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.self.key.bias torch.Size([768])
bert.encoder.layer.2.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.self.value.bias torch.Size([768])
bert.encoder.layer.2.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.2.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.2.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.2.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.2.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.2.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.2.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.2.output.dense.bias torch.Size([768])
bert.encoder.layer.2.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.2.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.3.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.self.query.bias torch.Size([768])
bert.encoder.layer.3.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.self.key.bias torch.Size([768])
bert.encoder.layer.3.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.self.value.bias torch.Size([768])
bert.encoder.layer.3.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.3.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.3.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.3.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.3.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.3.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.3.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.3.output.dense.bias torch.Size([768])
bert.encoder.layer.3.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.3.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.4.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.self.query.bias torch.Size([768])
bert.encoder.layer.4.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.self.key.bias torch.Size([768])
bert.encoder.layer.4.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.self.value.bias torch.Size([768])
bert.encoder.layer.4.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.4.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.4.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.4.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.4.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.4.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.4.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.4.output.dense.bias torch.Size([768])
bert.encoder.layer.4.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.4.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.5.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.self.query.bias torch.Size([768])
bert.encoder.layer.5.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.self.key.bias torch.Size([768])
bert.encoder.layer.5.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.self.value.bias torch.Size([768])
bert.encoder.layer.5.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.5.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.5.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.5.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.5.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.5.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.5.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.5.output.dense.bias torch.Size([768])
bert.encoder.layer.5.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.5.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.6.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.self.query.bias torch.Size([768])
bert.encoder.layer.6.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.self.key.bias torch.Size([768])
bert.encoder.layer.6.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.self.value.bias torch.Size([768])
bert.encoder.layer.6.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.6.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.6.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.6.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.6.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.6.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.6.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.6.output.dense.bias torch.Size([768])
bert.encoder.layer.6.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.6.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.7.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.self.query.bias torch.Size([768])
bert.encoder.layer.7.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.self.key.bias torch.Size([768])
bert.encoder.layer.7.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.self.value.bias torch.Size([768])
bert.encoder.layer.7.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.7.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.7.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.7.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.7.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.7.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.7.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.7.output.dense.bias torch.Size([768])
bert.encoder.layer.7.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.7.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.8.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.self.query.bias torch.Size([768])
bert.encoder.layer.8.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.self.key.bias torch.Size([768])
bert.encoder.layer.8.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.self.value.bias torch.Size([768])
bert.encoder.layer.8.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.8.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.8.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.8.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.8.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.8.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.8.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.8.output.dense.bias torch.Size([768])
bert.encoder.layer.8.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.8.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.9.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.self.query.bias torch.Size([768])
bert.encoder.layer.9.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.self.key.bias torch.Size([768])
bert.encoder.layer.9.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.self.value.bias torch.Size([768])
bert.encoder.layer.9.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.9.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.9.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.9.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.9.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.9.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.9.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.9.output.dense.bias torch.Size([768])
bert.encoder.layer.9.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.9.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.10.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.self.query.bias torch.Size([768])
bert.encoder.layer.10.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.self.key.bias torch.Size([768])
bert.encoder.layer.10.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.self.value.bias torch.Size([768])
bert.encoder.layer.10.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.10.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.10.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.10.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.10.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.10.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.10.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.10.output.dense.bias torch.Size([768])
bert.encoder.layer.10.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.10.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.11.attention.self.query.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.self.query.bias torch.Size([768])
bert.encoder.layer.11.attention.self.key.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.self.key.bias torch.Size([768])
bert.encoder.layer.11.attention.self.value.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.self.value.bias torch.Size([768])
bert.encoder.layer.11.attention.output.dense.weight torch.Size([768, 768])
bert.encoder.layer.11.attention.output.dense.bias torch.Size([768])
bert.encoder.layer.11.attention.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.11.attention.output.LayerNorm.bias torch.Size([768])
bert.encoder.layer.11.intermediate.dense.weight torch.Size([3072, 768])
bert.encoder.layer.11.intermediate.dense.bias torch.Size([3072])
bert.encoder.layer.11.output.dense.weight torch.Size([768, 3072])
bert.encoder.layer.11.output.dense.bias torch.Size([768])
bert.encoder.layer.11.output.LayerNorm.weight torch.Size([768])
bert.encoder.layer.11.output.LayerNorm.bias torch.Size([768])
bert.pooler.dense.weight torch.Size([768, 768])
bert.pooler.dense.bias torch.Size([768])
classifier.weight torch.Size([7, 768])
classifier.bias torch.Size([7])

为了实现numpy的bert模型,踩了两天的坑,一步步对比huggingface源码实现的,真的太难了~~~

这是使用numpy实现的bert代码,分数上和huggingface有稍微的一点点区别,可能是模型太大,保存的模型参数误差累计造成的!

看下面的代码真的有利于直接了解bert模型结构,各种细节简单又到位,自己都服自己,研究这个东西~~~

import numpy as np

def word_embedding(input_ids, word_embeddings):
return word_embeddings[input_ids]

def position_embedding(position_ids, position_embeddings):
return position_embeddings[position_ids]

def token_type_embedding(token_type_ids, token_type_embeddings):
return token_type_embeddings[token_type_ids]

def softmax(x, axis=None):
# e_x = np.exp(x).astype(np.float32) #
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
sum_ex = np.sum(e_x, axis=axis,keepdims=True).astype(np.float32)
return e_x / sum_ex

def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
if mask is not None:
scores = np.where(mask, scores, np.full_like(scores, -np.inf))
attention_weights = softmax(scores, axis=-1)
# print(attention_weights)
# print(np.sum(attention_weights,axis=-1))
output = np.matmul(attention_weights, V)
return output, attention_weights

def multihead_attention(input, num_heads,W_Q,B_Q,W_K,B_K,W_V,B_V,W_O,B_O):

q = np.matmul(input, W\_Q.T)+B\_Q  
k = np.matmul(input, W\_K.T)+B\_K  
v = np.matmul(input, W\_V.T)+B\_V

# 分割输入为多个头  
q = np.split(q, num\_heads, axis=-1)  
k = np.split(k, num\_heads, axis=-1)  
v = np.split(v, num\_heads, axis=-1)

outputs = \[\]  
for q\_,k\_,v\_ in zip(q,k,v):  
    output, attention\_weights = scaled\_dot\_product\_attention(q\_, k\_, v\_)  
    outputs.append(output)  
outputs = np.concatenate(outputs, axis=-1)  
outputs = np.matmul(outputs, W\_O.T)+B\_O  
return outputs

def layer_normalization(x, weight, bias, eps=1e-12):
mean = np.mean(x, axis=-1, keepdims=True)
variance = np.var(x, axis=-1, keepdims=True)
std = np.sqrt(variance + eps)
normalized_x = (x - mean) / std
output = weight * normalized_x + bias
return output

def feed_forward_layer(inputs, weight, bias, activation='relu'):
linear_output = np.matmul(inputs,weight) + bias

if activation == 'relu':  
    activated\_output = np.maximum(0, linear\_output)  # ReLU激活函数  
elif activation == 'gelu':  
    activated\_output = 0.5 \* linear\_output \* (1 + np.tanh(np.sqrt(2 / np.pi) \* (linear\_output + 0.044715 \* np.power(linear\_output, 3))))  # GELU激活函数

elif activation == "tanh" :  
    activated\_output = np.tanh(linear\_output)  
else:  
    activated\_output = linear\_output  # 无激活函数

return activated\_output

def residual_connection(inputs, residual):
# 残差连接
residual_output = inputs + residual
return residual_output

def tokenize_sentence(sentence, vocab_file = 'vocab.txt'):
with open(vocab_file, 'r', encoding='utf-8') as f:
vocab = f.readlines()
vocab = [i.strip() for i in vocab]
# print(len(vocab))

tokenized\_sentence = \['\[CLS\]'\] + list(sentence) + \["\[SEP\]"\] # 在句子开头添加\[cls\]  
token\_ids = \[vocab.index(token) for token in tokenized\_sentence\]

return token\_ids

加载保存的模型数据

model_data = np.load('bert_model_params.npz')

word_embeddings = model_data["bert.embeddings.word_embeddings.weight"]
position_embeddings = model_data["bert.embeddings.position_embeddings.weight"]
token_type_embeddings = model_data["bert.embeddings.token_type_embeddings.weight"]
def model_input(sentence):
token_ids = tokenize_sentence(sentence)
input_ids = np.array(token_ids) # 输入的词汇id
word_embedded = word_embedding(input_ids, word_embeddings)

position\_ids = np.array(range(len(input\_ids)))  # 位置id  
# 位置嵌入矩阵,形状为 (max\_position, embedding\_size)  
position\_embedded = position\_embedding(position\_ids, position\_embeddings)

token\_type\_ids = np.array(\[0\]\*len(input\_ids))  # 片段类型id  
# 片段类型嵌入矩阵,形状为 (num\_token\_types, embedding\_size)  
token\_type\_embedded = token\_type\_embedding(token\_type\_ids, token\_type\_embeddings)

embedding\_output = np.expand\_dims(word\_embedded + position\_embedded + token\_type\_embedded, axis=0)

return embedding\_output

def bert(input,num_heads):

ebd\_LayerNorm\_weight = model\_data\['bert.embeddings.LayerNorm.weight'\]  
ebd\_LayerNorm\_bias = model\_data\['bert.embeddings.LayerNorm.bias'\]  
input = layer\_normalization(input,ebd\_LayerNorm\_weight,ebd\_LayerNorm\_bias)     #这里和模型输出一致

for i in range(12):  
    # 调用多头自注意力函数  
    W\_Q = model\_data\['bert.encoder.layer.{}.attention.self.query.weight'.format(i)\]  
    B\_Q = model\_data\['bert.encoder.layer.{}.attention.self.query.bias'.format(i)\]  
    W\_K = model\_data\['bert.encoder.layer.{}.attention.self.key.weight'.format(i)\]  
    B\_K = model\_data\['bert.encoder.layer.{}.attention.self.key.bias'.format(i)\]  
    W\_V = model\_data\['bert.encoder.layer.{}.attention.self.value.weight'.format(i)\]  
    B\_V = model\_data\['bert.encoder.layer.{}.attention.self.value.bias'.format(i)\]  
    W\_O = model\_data\['bert.encoder.layer.{}.attention.output.dense.weight'.format(i)\]  
    B\_O = model\_data\['bert.encoder.layer.{}.attention.output.dense.bias'.format(i)\]  
    attention\_output\_LayerNorm\_weight = model\_data\['bert.encoder.layer.{}.attention.output.LayerNorm.weight'.format(i)\]  
    attention\_output\_LayerNorm\_bias = model\_data\['bert.encoder.layer.{}.attention.output.LayerNorm.bias'.format(i)\]  
    intermediate\_weight = model\_data\['bert.encoder.layer.{}.intermediate.dense.weight'.format(i)\]  
    intermediate\_bias = model\_data\['bert.encoder.layer.{}.intermediate.dense.bias'.format(i)\]  
    dense\_weight = model\_data\['bert.encoder.layer.{}.output.dense.weight'.format(i)\]  
    dense\_bias = model\_data\['bert.encoder.layer.{}.output.dense.bias'.format(i)\]  
    output\_LayerNorm\_weight = model\_data\['bert.encoder.layer.{}.output.LayerNorm.weight'.format(i)\]  
    output\_LayerNorm\_bias = model\_data\['bert.encoder.layer.{}.output.LayerNorm.bias'.format(i)\]

    output = multihead\_attention(input, num\_heads,W\_Q,B\_Q,W\_K,B\_K,W\_V,B\_V,W\_O,B\_O)  
    output = residual\_connection(input,output)  
    output1 = layer\_normalization(output,attention\_output\_LayerNorm\_weight,attention\_output\_LayerNorm\_bias)    #这里和模型输出一致

    output = feed\_forward\_layer(output1, intermediate\_weight.T, intermediate\_bias, activation='gelu')  
    output = feed\_forward\_layer(output, dense\_weight.T, dense\_bias, activation='')  
    output = residual\_connection(output1,output)  
    output2 = layer\_normalization(output,output\_LayerNorm\_weight,output\_LayerNorm\_bias)   #一致

    input = output2

bert\_pooler\_dense\_weight = model\_data\['bert.pooler.dense.weight'\]  
bert\_pooler\_dense\_bias = model\_data\['bert.pooler.dense.bias'\]  
output = feed\_forward\_layer(output2, bert\_pooler\_dense\_weight.T, bert\_pooler\_dense\_bias, activation='tanh')    #一致  
return output

for i in model_data:

# print(i)

print(i,model_data[i].shape)

id2label = {0: 'mainland China politics', 1: 'Hong Kong - Macau politics', 2: 'International news', 3: 'financial news', 4: 'culture', 5: 'entertainment', 6: 'sports'}
if __name__ == "__main__":

# 示例用法  
sentence = '马拉松决赛'  
# print(model\_input(sentence).shape)  
output = bert(model\_input(sentence),num\_heads=12)  
# print(output)  
classifier\_weight = model\_data\['classifier.weight'\]  
classifier\_bias = model\_data\['classifier.bias'\]  
output = feed\_forward\_layer(output\[:,0,:\], classifier\_weight.T, classifier\_bias, activation='')  
# print(output)  
output = softmax(output,axis=-1)  
label\_id = np.argmax(output,axis=-1)  
label\_score = output\[0\]\[label\_id\]  
print(id2label\[label\_id\[0\]\],label\_score)

这是hugging face上找的一个别人训练好的模型,roberta模型作新闻7分类,并且保存模型结构为numpy格式,为了上面的代码加载

import numpy as np
from transformers import AutoModelForSequenceClassification,AutoTokenizer,pipeline
model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
print(text_classification("马拉松决赛"))

print(model)

打印BERT模型的权重维度

for name, param in model.named_parameters():
print(name, param.data.shape)

# # 保存模型参数为NumPy格式

model_params = {name: param.data.cpu().numpy() for name, param in model.named_parameters()}
np.savez('bert_model_params.npz', **model_params)

model_params

对比两个结果:

hugging face:[{'label': 'sports', 'score': 0.9929242134094238}]

numpy:sports [0.9928773]