色8888久久久久久影院伊人,1区2区3区av在线

概述#

算能BM1684X芯片已經(jīng)實現(xiàn)ChatGLM2-6B的C++代碼部署，代碼實現(xiàn)鏈接：https://github.com/sophgo/ChatGLM2-TPU。

本文總結(jié)部署該模型過程中的一些技術(shù)點。首先是ChatGLM2-6B的整體運行流程，然后介紹如何將該動態(tài)網(wǎng)路轉(zhuǎn)換成靜態(tài)網(wǎng)絡(luò)形式，接著介紹如何將該模型導(dǎo)出成ONNX。

最后如何將ONNX使用TPU-MLIR編譯器實現(xiàn)網(wǎng)絡(luò)的編譯以及用C++代碼編寫應(yīng)用程序，可以直接看源碼就可以理解，這里不做介紹。

ChatGLM2-6b運行流程#

如圖該網(wǎng)絡(luò)基本上可以分為5個階段：

將句子通過分詞器(使用google的sentencepiece）轉(zhuǎn)換成tokens，如圖中的<1x17 xi32>的數(shù)據(jù)。注意tokens里面64790, 64792是起始符號。

通過WordEmbedding將tokens轉(zhuǎn)換成詞向量，得到的結(jié)果是<1x17x4096 xf32>的數(shù)據(jù)。

通過Tranformer進行神經(jīng)網(wǎng)絡(luò)推理，推理結(jié)果是<1x17x4096 xf32>，答案在最后的詞向量上，所以做一個額外的slice操作，得到<1x1x4096 xf32>。這里Transformer網(wǎng)絡(luò)是由28個Block組成，每個Block的核心是一個Attention運算，并輸出kv cache給下一輪Transform做為輸入。

經(jīng)過LmHead操作生成<1x1 xi32>的結(jié)果，也就是輸出的Token。LmHead的組成如圖所示。

Token經(jīng)過分詞器轉(zhuǎn)成詞語，且傳遞給下一輪推理，進入第一階段。直到token == EOS_ID結(jié)束。

轉(zhuǎn)成靜態(tài)網(wǎng)絡(luò)#

ChatGLM2-6B從前面的描述中，可以看到有兩處是動態(tài)的，一是因句子的長短不同，Transformer的輸入Shape有所有不同；二是每一輪Transformer生成的kv cache會逐步增長。為了方便部署，根據(jù)網(wǎng)絡(luò)特點轉(zhuǎn)換成靜態(tài)網(wǎng)絡(luò)。轉(zhuǎn)換后的運行流程如下：

從圖中可以看到句子不論長短，轉(zhuǎn)換后的tokens始終是<1x512x i32>，kv cache的數(shù)據(jù)也始終是<512x1x2x128x f32>。

這里介紹最關(guān)鍵的幾點：

將原始tokens輸入尾部補0，從<1x17x i32>轉(zhuǎn)換成<1x512x i32>。

將position_ids從GlmBlock中提取出來，并固定長度為<1x512x i32>，也是尾部補0，本例中的數(shù)值為[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,0,0,0,0,...0]，用于位置編碼。因為在原始網(wǎng)絡(luò)中這部分是變長，提取出來后就可以做成定長。

將attention_mask從GlmBlock中提取出來，并固定長度為<1x512x512x f32>，注意無效部分全部補1，因為它之后會有masked_fill操作將mask為1的部分全部配置為-inf。然后經(jīng)過Softmax使超出部分全部清0，保留有效部分，從而保證最終結(jié)果與原始結(jié)果一致。如下圖，為說明問題，Attention做了簡化。

第一輪Transformer后，kv_chache有效部分是[0:17]，我們將該部分移到末尾[512-17:]，并頭部清0。因為kv cache的累加發(fā)生在尾部。從第二輪開始累加后做Slice操作去掉頭部1個單位，取[1:]，這樣就保證了kv cache始終保持在512。同時attention mask也要是從尾部實際token len長度的0，頭部全部置1。

導(dǎo)出ONNX#

將該網(wǎng)絡(luò)分為4塊：WorkEmbedding，GlmBlock，GlmBlockCache，LmHead。這里分別介紹這四塊是如何導(dǎo)出的。

導(dǎo)出前，先要指定python路徑，如下：

1	export PYTHONPATH=/workspace/chatglm2-6b:$PYTHONPATH

需要先加載原始ChatGLM2-6B，如下代碼：

CHATGLM2_PATH = "/workspace/chatglm2-6b"

origin_model = AutoModel.from_pretrained(CHATGLM2_PATH,
                                         trust_remote_code=True).float()
origin_model.eval()
transformer = origin_model.transformer
MAX_LEN = transformer.seq_length
for param in origin_model.parameters():
    param.requires_grad = False
num_layers = transformer.encoder.num_layers
layers = transformer.encoder.layers

WorkEmbedding#

直接使用原模型中的word_embeddings，構(gòu)建成獨立網(wǎng)絡(luò)，導(dǎo)出即可

class Embedding(torch.nn.Module):

    def __init__(self):
        super().__init__()

    def forward(self, input_ids):
        return transformer.embedding.word_embeddings(input_ids)

def convert_embedding():
    model = Embedding()
    torch.onnx.export(model, (torch.tensor([0, 1, 2, 3])),
                      f'./tmp/embedding.onnx',
                      verbose=False,
                      input_names=['input_ids'],
                      output_names=['input_embed'],
                      dynamic_axes={"input_ids": {0: "length"}},
                      do_constant_folding=True,
                      opset_version=15)

GlmBlock#

需要將transformer.rotary_pos_emb和transformer.encoder.layers組合構(gòu)建獨立網(wǎng)路，導(dǎo)出。因為有28個block，所以需要導(dǎo)出28個ONNX模型。這里的position_ids和attention_mask作為外部輸入，前面有介紹。其實這28個Block是可以組合成一個模型，但是這樣導(dǎo)致onnx權(quán)重過大(F16約12GB)，導(dǎo)出麻煩，部署也麻煩，所以單個導(dǎo)出。

class GlmBlock(torch.nn.Module):

    def __init__(self, layer_id):
        super().__init__()
        self.layer = layers[layer_id]

    def forward(self, hidden_states, position_ids, attention_mask):
        rotary_pos_emb = transformer.rotary_pos_emb(MAX_LEN)[position_ids]
        rotary_pos_emb = rotary_pos_emb.transpose(0, 1).contiguous()
        hidden_states, past_kv = self.layer(hidden_states, attention_mask,
                                            rotary_pos_emb=rotary_pos_emb)
        return hidden_states, past_kv

def convert_glm_block(layer_id):
    model = GlmBlock(layer_id)
    torch.onnx.export(
        model, (hidden_states, position_ids, attention_mask),
        f'./tmp/glm_block_{layer_id}.onnx',
        verbose=False,
        input_names=['input_states', 'position_ids', 'attention_mask'],
        output_names=['hidden_states', 'past_k', 'past_v'],
        do_constant_folding=True,
        opset_version=15)

GlmBlockCache#

與`GlmBlock是類似的，但是需要額外的kv cache參數(shù)。注意這里最后會把頭部1個單位去除掉。

class GlmBlockCache(torch.nn.Module):
    def __init__(self, layer_id):
        super().__init__()
        self.layer = layers[layer_id]

    def forward(self, hidden_states, position_ids, attention_mask, past_k, past_v):
        rotary_pos_emb = transformer.rotary_pos_emb(MAX_LEN)[position_ids]
        rotary_pos_emb = rotary_pos_emb.transpose(0, 1).contiguous()
        hidden_states, past_kv = self.layer(hidden_states, attention_mask,
                                            kv_cache=(past_k, past_v),
                                            rotary_pos_emb=rotary_pos_emb)
        past_k, past_v = past_kv
        return hidden_states, past_k[1:], past_v[1:]

def convert_glm_block_cache(layer_id):
    model = GlmBlockCache(layer_id)
    torch.onnx.export(
        model, (hidden_states, position_ids, attention_mask, past_k, past_v),
        f'./tmp/glm_block_cache_{layer_id}.onnx',
        verbose=False,
        input_names=['input_states', 'position_ids', 'attention_mask', 'history_k', 'history_v'],
        output_names=['hidden_states', 'past_k', 'past_v'],
        do_constant_folding=True,
        opset_version=15)

LmHead#

這里取m_logits后使用topk，其實也是可以用argmax，看芯片實現(xiàn)哪一種效率高。

class LmHead(torch.nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, hidden_states):
        hidden_states = transformer.encoder.final_layernorm(hidden_states)
        m_logits = transformer.output_layer(hidden_states)
        _, token = torch.topk(m_logits, 1)
        return token
def convert_lm_head():
    model = LmHead()
    input = torch.randn(1, 4096)
    torch.onnx.export(model, (input), f'./tmp/lm_head.onnx', verbose=False,
                      input_names=['hidden_states'],
                      output_names=['token'],
                      do_constant_folding=True,
                      opset_version=15)

部署#

上述轉(zhuǎn)完ONNX模型后都已經(jīng)是靜態(tài)網(wǎng)絡(luò)，通過TPU-MLIR，可以很容易的轉(zhuǎn)換成F16的模型。但是特別要注意的是RmsNorm需要用F32。之后就可以按照執(zhí)行邏輯編寫C++代碼。演示效果如下：

審核編輯：湯梓紅

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請聯(lián)系本站處理。舉報投訴