Notes of Stable Cascaded Shadow Maps

PD
- プリミティブディストリビューター : インデックスバッファをロードして、チップ中にプリミティブを分配する
- PD (Primitive Distributor) does index-buffer loads and distributes primitives across the chip.
VAF
- 頂点属性フェッチ : 頂点バッファをロードする
- VAF (Vertex Attribute Fetch) does vertex-buffer loads (before the vertex shader gets launched).
SM
- ストリーミングマルチプロセッサ : シェーダを実行
- SM (Streaming Multiprocessor) runs the shaders.
TEX
- SRV をフェッチする、Maxwell からは UAV アクセスも
- TEX performs SRV fetches (and UAV accesses, since Maxwell).
VPC
- ビューポートカリング : ビューポート変換、フラスタムカリング、属性の透視投影変換
- VPC (Viewport Culling) does the viewport transform, frustum culling, and perspective correction of attributes.
L2
- L2 は VRAM の前のキャッシュ
- L2 is the Level-2 cache attached to each VRAM memory partition.
CROP
- レンターターゲットにカラー書き込みとブレンディング
- CROP does color writes & blending to render targets.
ZROP
- デプスとステンシルテスト
- ZROP does depth-stencil testing.
VRAM
- GPU のメモリ
- VRAM (“Memory” in the Range Diagram) is the GPUvideo memory.

ユニットは下図のような構成

4.2. トップSOL のユニット ( The “Top SOL Units” )

ケース1: トップ SOL > 80% ( Case 1: Top SOL% > 80% )

80% より大きい場合はとても効率的で、最大スループットに近い
この場合, トップSOL から他に負荷を逃がすようにする
- SM の場合は命令をスキップしたり、ルックアップテーブルを使う
- 構造化バッファ(Structured-Buffer, StructuredBuffer)ではなく、定数バッファ(constant-buffer, cbuffer)を使う
  - 構造化バッファは TEX unit 経由でロードされるので、テクスチャスループットが悪化

ケース2: トップ SOL < 60% ( Case 2: Top SOL% < 60% )

アイドル状態で暇をしている
以下のどれか?
- CPU が原因で、制限がかかっている
- アイドル待ちのコマンドもしくは　グラフィックス <-> コンピュートの切り替えでGPUパイプラインが枯渇
- TEX のスループットが悪い : フォーマット/次元数/フィルタが原因
  - TEX SOL が 50% になるケース : 3Dテクスチャのトライリニアフィルタリング
- メモリサブシステムが非効率
- TEX か L2 のヒット率が低い
- VRAM への疎なアクセス
- VB/IB/CB/TEX が GPU の VRAM ではなく、システムメモリを参照している
- インプットアセンブラが 32bit のインデックスバッファを参照している ( 16bit と比較すると, 半分のレート)

ケース3: トップSOL が 60%-80% の間 ( Case 3: Top SOL% in [60, 80] )

上のケース1 とケース2 の両方を行う

4.3. 2位の SOL と TEX/L2 のヒット率 ( Secondary SOL Units and TEX & L2 Hit Rates )

ハードウェアユニットが相互に関係しているので、2位の SOL も見ること

TEX(L1) と L2 のヒット率について

90% 以上 : 素晴らしい
70-90% : とても良い
70%未満 : 悪いので、パフォーマンス低下の原因

ステップ5: トップ SOL のパフォーマンスに制限をかけているものの理解 (Step 5: Understand the Performance Limiters)

5.1. もしトップSOL が低い場合(If the Top SOL% is Low)

複数の要因で SOL が低い可能性がある
“Graphics/Compute Idle%” と “SM Active %”をまずは見る

5.1.1. (The “Graphics/Compute Idle%” metric)

これは現在の負荷に対して、Graphics/Compute のパイプラインがアイドルである割合、GPU のサイクルの経過ベースで求めたもの
- the percentage of the GPU elapsed cycles during which the whole Graphics & Compute hardware pipeline was fully idle for the current workload
これが高くなってしまう原因は下の２つ
- (A) CPU がコマンドを十分に早く送っていない
- (B) アプリケーションが synchronous Copy Engine を使っている
  - これは Direct queue か Immediate Context で, Copy Call を呼ぶと起きる
Wait For Idle コマンドによるパイプライン枯渇は “Graphics/Compute Idle” には含まれない

下の CPU call でかかっている時間を計測するのをおすすめする
DX11 の場合
- Flush{,1}, Map, UpdateSubresource{,1}
DX12 の場合
- Wait, ExecuteCommandLists
DX11 と DX12
- Create か Release の calls

DX11 の場合

ID3D11DeviceContext::Flush はコマンドバッファをキックオフを強制する、Flush() で CPU 失速が起きる
- ID3D11DeviceContext::Flush forces a command-buffer kickoff, which may require the Flush() call to stall on the CPU.
STAGING リソースで ID3D11DeviceContext::Map を呼び、同じステージのリソースを連続したフレームでマッピングすると CPU失速
- Calling ID3D11DeviceContext::Map on a STAGING resource can cause a CPU stall due to resource contention, when mapping the same staging resource in consecutive frames.
この場合、前のフレームのリソースが処理されまで、現在のフレームの Map は内部で待たされる
- In this case, the Map call in the current frame must wait internally until the previous frame (which is using the same resource) has been processed before returning.
ID3D11DeviceContext::Map を DX11_MAP_WRITE_DISCARD で呼ぶと、ドライバがバージョニングスペースが不足すると CPU 失速
- Calling ID3D11DeviceContext::Map with DX11_MAP_WRITE_DISCARD can cause a CPU stall due to the driver running out of versioning space.
Map(WRITE_DISCARD)が実行されると、固定サイズのメモリプールから新しいポインタを確保する
- That is because each time a Map(WRITE_DISCARD) call is performed, our driver returns a new pointer to a fixed-size memory pool.
ドライバがバージョンニングスペースに不足すると, CPU 失速
- If the driver runs out of versioning space, the Map call stalls.

DX12 の場合

新しいコマンドバッファをキックオフする場合、各 ExecuteCommandLists (ECL) は少し GPU アイドルのオーバーヘッドが発生
- Each ExecuteCommandLists (ECL) call has some GPU idle overhead associated with it, for kicking off a new command buffer.
なので、少ない ECL コールにすべてのコマンドリストをバッチングするのをおすすめする、特定のフレームでコマンドバッファを実行したくない限りは ( 例: VR アプリで入力遅延をなくしたい場合 )
- So, to reduce GPU idle time, we recommend batching all your command lists into as few ECL calls as possible, unless you really want command-buffer kickoffs to happen at certain points in the frame (for example, to reduce input latency in VR apps with a single frame in flight).
アプリが ID3D12CommandQueue::Wait を fenceで呼ぶ場合、Wait コールが return するまでは OS が GPU のそのコマンドキューに　新しいコマンドバッファを送るのを控えてしまう
- When an application calls ID3D12CommandQueue::Wait on a fence, the OS (Windows 10) holds off submitting new command buffers to the GPU for that command queue until the Wait call returns.

Gregory Igehy

Dancing at hemisphere coordinate

Notes of Stable Cascaded Shadow Maps

Games

Blog/Presentation

Notes of Sample Distribution Shadow Maps

RenderDoc Performance Counter

UE4でNVIDIA NSight Graphicsを使ったGPUプロファイリングをしてみる

The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload

ステップ4 トップ SOL とキャシュヒット率の調査 (Inspecting the Top SOLs & Cache Hit Rates)

4.1 ユニットごとの SOL について (The Per-Unit SOL% Metrics)

4.2. トップSOL のユニット ( The “Top SOL Units” )

ケース1: トップ SOL > 80% ( Case 1: Top SOL% > 80% )

ケース2: トップ SOL < 60% ( Case 2: Top SOL% < 60% )

ケース3: トップSOL が 60%-80% の間 ( Case 3: Top SOL% in [60, 80] )

4.3. 2位の SOL と TEX/L2 のヒット率 ( Secondary SOL Units and TEX & L2 Hit Rates )

TEX(L1) と L2 のヒット率について

ステップ5: トップ SOL のパフォーマンスに制限をかけているものの理解 (Step 5: Understand the Performance Limiters)

5.1. もしトップSOL が低い場合(If the Top SOL% is Low)

5.1.1. (The “Graphics/Compute Idle%” metric)

DX11 の場合

DX12 の場合

5.1.2. The “SM Active%” metric

5.1.3. GPU Trace

5.2. If the Top SOL Unit is the SM

5.2.1. Case 1: “SM Throughput For Active Cycles” > 80%

5.2.2. Case 2: “SM Throughput For Active Cycles” < 60%

5.2.3. Case 3: SM Throughput For Active Cycles % in [60,80]

5.3. If the Top SOL unit is not the SM

5.3.1. If the Top SOL unit is TEX, L2, or VRAM

5.3.2. If the Top SOL unit is CROP or ZROP

5.3.3. If the Top SOL unit is PD

5.3.4. If the Top SOL unit is VAF