性少妇中国videos,国产午夜精品每日更新

經過前面幾章關于triton在nv gpu上調優(yōu)的講解，我們這章開始來看看triton的一個third_party庫，該庫是為了讓triton去支持更多其他的backend。該項目的地址如下所示，并且已經在triton的main分支中，作為third_party進行了官方支持，在clone triton的時候，只需要帶上recursive的flag就可以完成對triton-shared的使用。

什么是triton-shared？

關于triton-shared的官方具體實現，如下github repo所示：

GitHub - microsoft/triton-shared: Shared Middle-Layer for Triton Compilationgithub.com/microsoft/triton-shared

如下所示為官方對triton-shared的解釋:

Asharedmiddle-layerfortheTritonCompiler.

Currentlythemiddlelayerisnotcompletebuthasenoughfunctionalitytodemonstratehowitcanwork.ThegeneralideaisthatTritonIRisloweredintoanMLIRcoredialecttoallowittobebothsharedacrossTritontargetsaswellasallowback-endstobesharedwithotherlanguages.

Thebasicintendedarchitecturelookslikethis:

[TritonIR]->[MiddleLayer]->[HWspecificIR]

Themiddle-layerusesMLIR'sLinalgandTenorDialectsforoperationsonTritonblockvalues.OperationsonTritonpointersusetheMemrefDialect.

triton-shared其實就是為了提供一個膠水一樣的中間層，通過對middle-layer的設計來方便我們的編程語言或者編譯器對接到下游不同的硬件生態(tài)，因為triton自身已經把nv和amd這兩個比較常見的GPU后端實現了，如果第三方的廠商想通過復用triton的前端來對自己的芯片搞一套編譯flow，那么triton-shared就起到了決定性的作用。下面這個圖是triton的codebase所希望支持的一個愿景，可以看出來，中間這條垂直下來的分支就是triton所支持的nv gpu的優(yōu)化路線，當用戶寫完的triton dsl會被翻譯成python的AST，然后再從AST到對應的triton dialect，從這一步開始，也就正式將用戶手寫的成分轉到了MLIR這套生態(tài)，然后再從triton dialect進一步優(yōu)化到triton gpu dialect,從trition gpu dialect開始，就走了比較標準的LLVM代碼生成，從LLVM IR一路lower到PTX，再到SASS，最終可以成功運行在NV的GPU上，這套codegen的路線相比TVM等其他編譯框架來說更加的激進，直接越過了nvcc compiler，從而使得整個過程都變成了透明的，對于性能優(yōu)化來說帶來了更多的可能。

img

添加圖片注釋，不超過 140 字（可選）

triton-shared其實主要是用來cover最右邊的分支，因為熟悉MLIR的朋友都知道，在右邊的分支中，Linalg dialect是一個非常重要dialect，該dialect可以去承接很多不同的backend，在主流一些backend的編譯優(yōu)化環(huán)節(jié)，都會將Linalg作為主要的dialect來進行上下游不同dialect之間的轉換與對接。

Triton-shared的安裝

Triton-shared的安裝其實也很簡單，只需要一開始通過recursive來clone整個triton的主分支，然后使用

exportTRITON_CODEGEN_TRITON_SHARED=1

來指明，我們在build triton整個項目的過程中需要使用到triton-shared這個第三方的庫。接下來的流程按照triton官方repo的readme一步一步進行即可，有關LLVM我是使用的具體commit id下手動編譯得到的llvm

LLVMcommitid:b1115f8ccefb380824a9d997622cc84fc0d84a89
Tritoncommitid:1c2d2405bf04dca2de140bccd65480c3d02d995e

為什么要選擇如上兩個固定的commit id，其實理由很簡單，因為我前面做過一些關于triton和llvm的開發(fā)都是基于上面兩個id做的，所以后面我的所有教程以及案例展示都是以這兩個commit id為主進行。如果不知道怎么從0開始編譯triton，可以參考我之前的教程：

科研敗犬丶：OpenAI/Triton MLIR 第零章: 源碼編譯70 贊同 · 7 評論文章

Triton-shared的使用

講解完了什么是triton-shared，以及triton-shared怎么安裝，接下來，我們就來談談如何使用已經被編譯好的triton-shared。當你按照我的上述流程編譯好triton后，會在該路徑下：

/triton/build/tools/triton-shared-opt

看到一個triton-shared-opt的可執(zhí)行文件，熟悉MLIR的同學可能很快發(fā)現該方法其實就是MLIR中最基本的opt，該二進制文件可以完成從一個dialect向另外一個dialect的lowering，那么我們使用--help來看看triton-shared-opt的所有功能。如果能在終端中輸出如下所示的信息，說明你的triton-shared已經全部安裝完畢了。

OVERVIEW:Triton-Sharedtestdriver

AvailableDialects:arith,builtin,cf,gpu,math,scf,triton_gpu,tt
USAGE:triton-shared-opt[options]

OPTIONS:

ColorOptions:

--color-Usecolorsinoutput(default=autodetect)

Generaloptions:

--abort-on-max-devirt-iterations-reached-AbortwhenthemaxiterationsfordevirtualizationCGSCCrepeatpassisreached
--allow-unregistered-dialect-Allowoperationwithnoregistereddialects
Compilerpassestorun
Passes:
--affine-data-copy-generate-Generateexplicitcopyingforaffinememoryoperations
--fast-mem-capacity=-SetfastmemoryspacecapacityinKiB(default:unlimited)
--fast-mem-space=-Fastmemoryspaceidentifierforcopygeneration(default:1)
--generate-dma-GenerateDMAinsteadofpoint-wisecopy
--min-dma-transfer=-MinimumDMAtransfersizesupportedbythetargetinbytes
--skip-non-unit-stride-loops-Testingpurposes:avoidnon-unitstrideloopchoicedepthsforcopyplacement
--slow-mem-space=-Slowmemoryspaceidentifierforcopygeneration(default:0)
--tag-mem-space=-Tagmemoryspaceidentifierforcopygeneration(default:0)
--affine-expand-index-ops-Loweraffineoperationsoperatingonindicesintomorefundamentaloperations
--affine-loop-coalescing-Coalescenestedloopswithindependentboundsintoasingleloop
--affine-loop-fusion-Fuseaffineloopnests
...

這里先來展示

--triton-to-linalg-ConvertTritontoLinalgdialect

這個pass的使用，因為triton-shared主要就是用來做該優(yōu)化的。他表示的就是將triton dialect作為輸入，然后經過triton-to-linalg這個pass，將其lowering到具有相同語義的linalg dialect上，那triton dialect從哪里來得到呢？不要慌，triton-shared的repo為我們提供了很多MLIR格式的文件來方便我們使用該功能，具體路徑如下：

/triton/third_party/triton_shared/test/Conversion/TritonToLinalg/*

在該教程中，我們使用dot.mlir作為案例進行分析，具體代碼如下所示：

//RUN:triton-shared-opt--triton-to-linalg%s|FileCheck%s
module{
tt.func@kernel(
%arg0:!tt.ptr,
%arg1:!tt.ptr,
%arg2:!tt.ptr
)
{
%0=tt.make_range{end=128:i32,start=0:i32}:tensor<128xi32>
%c64=arith.constant128:i32
%1=tt.splat%c64:(i32)->tensor<128xi32>
%2=arith.muli%0,%1:tensor<128xi32>
%3=tt.expand_dims%2{axis=1:i32}:(tensor<128xi32>)->tensor<128x1xi32>
%4=tt.broadcast%3:(tensor<128x1xi32>)->tensor<128x64xi32>
%5=tt.make_range{end=64:i32,start=0:i32}:tensor<64xi32>
%6=tt.expand_dims%5{axis=0:i32}:(tensor<64xi32>)->tensor<1x64xi32>
%7=tt.broadcast%6:(tensor<1x64xi32>)->tensor<128x64xi32>
%8=arith.addi%4,%7:tensor<128x64xi32>
%10=tt.make_range{end=256:i32,start=0:i32}:tensor<256xi32>
%11=tt.expand_dims%10{axis=1:i32}:(tensor<256xi32>)->tensor<256x1xi32>
%12=tt.broadcast%11:(tensor<256x1xi32>)->tensor<256x64xi32>
%13=tt.make_range{end=64:i32,start=0:i32}:tensor<64xi32>
%c256=arith.constant256:i32
%14=tt.splat%c256:(i32)->tensor<64xi32>
%15=arith.muli%13,%14:tensor<64xi32>
%16=tt.expand_dims%15{axis=0:i32}:(tensor<64xi32>)->tensor<1x64xi32>
%17=tt.broadcast%16:(tensor<1x64xi32>)->tensor<256x64xi32>
%18=arith.addi%12,%17:tensor<256x64xi32>
%20=tt.splat%c256:(i32)->tensor<128xi32>
%21=arith.muli%0,%20:tensor<128xi32>
%22=tt.expand_dims%21{axis=1:i32}:(tensor<128xi32>)->tensor<128x1xi32>
%23=tt.broadcast%22:(tensor<128x1xi32>)->tensor<128x256xi32>
%24=tt.expand_dims%10{axis=0:i32}:(tensor<256xi32>)->tensor<1x256xi32>
%25=tt.broadcast%24{axis=0:i32}:(tensor<1x256xi32>)->tensor<128x256xi32>
%26=arith.addi%23,%25:tensor<128x256xi32>
%30=tt.splat%arg0:(!tt.ptr)->tensor<128x64x!tt.ptr>
%31=tt.addptr%30,%8:tensor<128x64x!tt.ptr>,tensor<128x64xi32>
%32=tt.load%31{cache=1:i32,evict=1:i32,isVolatile=false}:tensor<128x64xbf16>
%40=tt.splat%arg1:(!tt.ptr)->tensor<256x64x!tt.ptr>
%41=tt.addptr%40,%18:tensor<256x64x!tt.ptr>,tensor<256x64xi32>
%42=tt.load%41{cache=1:i32,evict=1:i32,isVolatile=false}:tensor<256x64xbf16>
%43=tt.trans%42:(tensor<256x64xbf16>)->tensor<64x256xbf16>
%50=tt.splat%arg2:(!tt.ptr)->tensor<128x256x!tt.ptr>
%51=tt.addptr%50,%26:tensor<128x256x!tt.ptr>,tensor<128x256xi32>
%52=tt.load%51{cache=1:i32,evict=1:i32,isVolatile=false}:tensor<128x256xbf16>
%60=tt.dot%32,%43,%52{allowTF32=false,maxNumImpreciseAcc=0:i32}:tensor<128x64xbf16>*tensor<64x256xbf16>->tensor<128x256xbf16>
tt.store%51,%60:tensor<128x256xbf16>
tt.return
}
}

上述MLIR其實很容易看懂，在%0->%10其實都是triton dialect的內容，該內容表示的就是從上層的triton dsl通過lower轉換到對應的triton dialect的過程。其中tt就是表示的該MLIR所處的dialect是triton dialect，然后tt.xxx則表示了該dialect所支持的所有operation，有關如何定義一個MLIR dialect，我準備拿一個單獨的教程來講。

接下來，只需要在終端中輸入

./triton-shared-opt--triton-to-linalg/triton/third_party/triton_shared/test/Conversion/TritonToLinalg/dot.mlir

就可以得到從triton dialect轉到linag dialect部分對應的內容

#map=affine_map<(d0,?d1)?->(d0,d1)>
module{
func.func@kernel(%arg0:memref<*xbf16>,%arg1:memref<*xbf16>,%arg2:memref<*xbf16>,%arg3:i32,%arg4:i32,%arg5:i32,%arg6:i32,%arg7:i32,%arg8:i32){
%c256=arith.constant256:index
%c128=arith.constant128:index
%reinterpret_cast=memref.reinterpret_cast%arg0tooffset:[0],sizes:[128,64],strides:[%c128,1]:memref<*xbf16>tomemref<128x64xbf16,?strided<[?,?1]>>
%alloc=memref.alloc():memref<128x64xbf16>
memref.copy%reinterpret_cast,%alloc:memref<128x64xbf16,?strided<[?,?1]>>tomemref<128x64xbf16>
%0=bufferization.to_tensor%allocrestrictwritable:memref<128x64xbf16>
%reinterpret_cast_0=memref.reinterpret_cast%arg1tooffset:[0],sizes:[256,64],strides:[1,%c256]:memref<*xbf16>tomemref<256x64xbf16,?strided<[1,??]>>
%alloc_1=memref.alloc():memref<256x64xbf16>
memref.copy%reinterpret_cast_0,%alloc_1:memref<256x64xbf16,?strided<[1,??]>>tomemref<256x64xbf16>
%1=bufferization.to_tensor%alloc_1restrictwritable:memref<256x64xbf16>
%2=tensor.empty():tensor<64x256xbf16>
%transposed=linalg.transposeins(%1:tensor<256x64xbf16>)outs(%2:tensor<64x256xbf16>)permutation=[1,0]
%reinterpret_cast_2=memref.reinterpret_cast%arg2tooffset:[0],sizes:[128,256],strides:[%c256,1]:memref<*xbf16>tomemref<128x256xbf16,?strided<[?,?1]>>
%alloc_3=memref.alloc():memref<128x256xbf16>
memref.copy%reinterpret_cast_2,%alloc_3:memref<128x256xbf16,?strided<[?,?1]>>tomemref<128x256xbf16>
%3=bufferization.to_tensor%alloc_3restrictwritable:memref<128x256xbf16>
%4=tensor.empty():tensor<128x256xbf16>
%5=linalg.matmulins(%0,%transposed:tensor<128x64xbf16>,tensor<64x256xbf16>)outs(%4:tensor<128x256xbf16>)->tensor<128x256xbf16>
%6=linalg.generic{indexing_maps=[#map,#map,#map],iterator_types=["parallel","parallel"]}ins(%5,%3:tensor<128x256xbf16>,tensor<128x256xbf16>)outs(%5:tensor<128x256xbf16>){
^bb0(%in:bf16,%in_4:bf16,%out:bf16):
%7=arith.addf%in,%in_4:bf16
linalg.yield%7:bf16
}->tensor<128x256xbf16>
memref.tensor_store%6,%reinterpret_cast_2:memref<128x256xbf16,?strided<[?,?1]>>
return
}
}

關于其他更加具體的operator，我們可以都按照上述流程來進行操作，一旦你的編譯框架是基于MLIR來開發(fā)的，那么如果能很好的轉到Linalg，那么就說明了后續(xù)在接入自己的backend以及適配一些ISA的過程就會方便不少，這也從另外一個角度彰顯了為什么現在的趨勢都是將自己的compiler通過MLIR進行重構。最重要的原因，其實就是以最小的開發(fā)成本方便的接入各種軟件或者硬件的生態(tài)。

后記

對triton的研究已經有一段時間了，由于當時學triton也是基于源碼一步一步硬吃過來的，并且triton也沒有比較好的中文教程，所以后面會利用空閑時間將我目前對于使用triton來做codegen的各種優(yōu)化方法(不同backend以及不同IR層面的pass)和細節(jié)(底層layout的設計)進行一個詳細的梳理，來幫助更多想要使用triton來做codegen的同學。

審核編輯：湯梓紅

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發(fā)燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規(guī)問題，請聯系本站處理。舉報投訴

gpu

gpu

+關注

關注
28

文章
4783

瀏覽量
129382
Triton

Triton

+關注

關注
0

文章
28

瀏覽量
7063
代碼

代碼

+關注

關注
30

文章
4830

瀏覽量
69094
編譯器

編譯器

+關注

關注
1

文章
1642

瀏覽量
49306

原文標題：OpenAI/Triton MLIR 第三章: Triton-shared開箱

文章出處：【微信號：GiantPandaCV，微信公眾號：GiantPandaCV】歡迎添加關注！文章轉載請注明出處。

Triton編譯器的原理和性能

Triton是一種用于編寫高效自定義深度學習原語的語言和編譯器。Triton的目的是提供一個開源環(huán)境，以比CUDA更高的生產力編寫快速代碼，但也比其他現有DSL具有更大的靈活性。Triton已被采用

發(fā)表于 12-16 11:22 ?3144次閱讀

在AMD GPU上如何安裝和配置triton？

最近在整理python-based的benchmark代碼，反過來在NV的GPU上又把Triton裝了一遍，發(fā)現Triton的github repo已經給出了對應的llvm的commit id以及對應的編譯細節(jié)，然后跟著走了一遍，也順利的

發(fā)表于 02-22 17:04 ?2615次閱讀

在AMD GPU上如何<b class='flag-5'>安裝</b>和配置<b class='flag-5'>triton</b>？

NVIDIA Triton推理服務器簡化人工智能推理

GKE 的 Triton 推理服務器應用程序是一個 helm chart 部署程序，可自動安裝和配置 Triton ，以便在具有 NVIDIA GPU 節(jié)點池的 GKE 集群上使用，包括

發(fā)表于 04-08 16:43 ?2276次閱讀

NVIDIA <b class='flag-5'>Triton</b>推理服務器簡化人工智能推理

Triton DataCenter云管理平臺

triton.zip

發(fā)表于 04-25 10:06 ?1次下載

<b class='flag-5'>Triton</b> DataCenter云管理平臺

NVIDIA Triton系列文章：開發(fā)資源說明

這里最重要的是 “server documents on GitHub” 鏈接，點進去后會進入整個 Triton 項目中最完整的技術文件中心（如下圖），除 Installation

發(fā)表于 11-09 16:17 ?793次閱讀

NVIDIA Triton 系列文章（6）：安裝用戶端軟件

在前面的文章中，已經帶著讀者創(chuàng)建好 Triton 的模型倉、安裝并執(zhí)行 Triton 推理服務器軟件，接下來就是要安裝 Triton 用戶

發(fā)表于 11-29 19:20 ?1252次閱讀

NVIDIA Triton 系列文章（10）：模型并發(fā)執(zhí)行

前面已經做好了每個推理模型的基礎配置，基本上就能正常讓 Triton 服務器使用這些獨立模型進行推理。接下來的重點，就是要讓設備的計算資源盡可能地充分使用，首先第一件事情就是模型并發(fā)執(zhí)行

發(fā)表于 01-05 11:55 ?1203次閱讀

Triton的具體優(yōu)化有哪些

上一章的反響還不錯，很多人都私信催更想看Triton的具體優(yōu)化有哪些，為什么它能夠得到比cuBLAS更好的性能。

發(fā)表于 05-16 09:40 ?1866次閱讀

<b class='flag-5'>Triton</b>的具體優(yōu)化有哪些

如何使用triton的language api來實現gemm的算子

前言通過前兩章對于triton的簡單介紹，相信大家已經能夠通過從源碼來安裝triton，同時通過triton提供的language前端寫出自己想要的一些計算密集型算子。這章開始，我們

發(fā)表于 05-29 14:34 ?2244次閱讀

如何使用<b class='flag-5'>triton</b>的language api來實現gemm的算子

Triton編譯器功能介紹 Triton編譯器使用教程

Triton 是一個開源的編譯器前端，它支持多種編程語言，包括 C、C++、Fortran 和 Ada。Triton 旨在提供一個可擴展和可定制的編譯器框架，允許開發(fā)者添加新的編程語言特性和優(yōu)化技術

發(fā)表于 12-24 17:23 ?665次閱讀

Triton編譯器支持的編程語言

Triton編譯器支持的編程語言主要包括以下幾種：一、主要編程語言 Python ：Triton編譯器通過Python接口提供了對Triton語言和編譯器的訪問，使得用戶可以在Python環(huán)境中

發(fā)表于 12-24 17:33 ?470次閱讀

Triton編譯器安裝步驟詳解

：用于構建項目。 Python ：用于運行 Triton 的 Python 綁定。其他依賴：根據您選擇的架構，可能需要額外的依賴。 2. 安裝依賴對于 Linux：打開終端并運行以下命令來安裝

發(fā)表于 12-24 17:35 ?701次閱讀

Triton編譯器的常見問題解決方案

Triton編譯器作為一款專注于深度學習的高性能GPU編程工具，在使用過程中可能會遇到一些常見問題。以下是一些常見問題的解決方案：一、安裝與依賴問題檢查Python版本 Triton編譯器通常

發(fā)表于 12-24 18:04 ?805次閱讀

Triton編譯器在機器學習中的應用

1. Triton編譯器概述 Triton編譯器是NVIDIA Triton推理服務平臺的一部分，它負責將深度學習模型轉換為優(yōu)化的格式，以便在NVIDIA GPU上高效運行。Triton

發(fā)表于 12-24 18:13 ?523次閱讀

Triton編譯器的優(yōu)化技巧

在現代計算環(huán)境中，編譯器的性能對于軟件的運行效率至關重要。Triton 編譯器作為一個先進的編譯器框架，提供了一系列的優(yōu)化技術，以確保生成的代碼既高效又適應不同的硬件架構。 1. 指令選擇

發(fā)表于 12-25 09:09 ?346次閱讀

欧美性猛交xxxx免费看_牛牛在线视频国产免费_天堂草原电视剧在线观看免费_国产粉嫩高清在线观看_国产欧美日本亚洲精品一5区

搜索歷史

什么是Triton-shared？Triton-shared的安裝和使用

評論