Skip to content

Instantly share code, notes, and snippets.

@chichunchen
Forked from mrprajesh/LearnNvprof.md
Created March 17, 2022 21:43
Show Gist options
  • Save chichunchen/f3b01da25b737c3f4cd00d6f42655e8b to your computer and use it in GitHub Desktop.
Save chichunchen/f3b01da25b737c3f4cd00d6f42655e8b to your computer and use it in GitHub Desktop.

Revisions

  1. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -23,9 +23,11 @@ nvprof --query-metrics
    ```

    1. How to query for all metric?

    ```nvprof --metrics all ./executable```

    2. How to query for a specific metric? say Dram reads.

    ```nvprof --metrics dram_read_transactions ./executable```


  2. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 0 additions and 2 deletions.
    2 changes: 0 additions & 2 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -20,14 +20,12 @@ In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/

    ```
    nvprof --query-metrics
    ```

    1. How to query for all metric?
    ```nvprof --metrics all ./executable```

    2. How to query for a specific metric? say Dram reads.

    ```nvprof --metrics dram_read_transactions ./executable```


  3. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,7 @@
    # nvprof - NVCC Profiler
    How to use it?
    It is Nvidia's Profiler, profiles any executable including CUDA programs.

    ### How to use it?
    ```
    nvprof ./executable
    ```
  4. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -12,7 +12,7 @@ In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/
    3. Time taken to DtoH and HtoD
    4. ..

    ## List of meterics available
    ## How to get all/certain meterics from nvprof

    0. How to find all the metric available for the device? It is a big list see at EOF.

  5. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 5 additions and 3 deletions.
    8 changes: 5 additions & 3 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,8 @@
    # nvprof - NVCC Profiler
    How to use it?
    ``` nvprof ./executable```
    ```
    nvprof ./executable
    ```

    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.

    @@ -20,11 +22,11 @@ nvprof --query-metrics
    ```

    1. How to query for all metric?
    ```nvprof --metrics all ```
    ```nvprof --metrics all ./executable```

    2. How to query for a specific metric? say Dram reads.

    ```nvprof --metrics dram_read_transactions ```
    ```nvprof --metrics dram_read_transactions ./executable```


    ### List
  6. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,7 @@
    # nvprof - NVCC Profiler
    How to use it? ``` nvprof ./executable```
    `
    How to use it?
    ``` nvprof ./executable```

    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.

    ### What all you can see
  7. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,5 @@
    # nvprof - NVCC Profiler
    How to use it? ` nvprof ./executable`
    How to use it? ``` nvprof ./executable```
    `
    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.

  8. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 3 additions and 5 deletions.
    8 changes: 3 additions & 5 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,11 +1,9 @@
    # nvprof - NVCC Profiler
    How to use it?
    ```
    nvprof ./executable
    ```
    How to use it? ` nvprof ./executable`
    `
    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.

    ### What are you can see
    ### What all you can see
    1. Number of times kernel is invoked
    2. Kernel execution time
    3. Time taken to DtoH and HtoD
  9. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -3,7 +3,7 @@ How to use it?
    ```
    nvprof ./executable
    ```
    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-`MAJOR.minor`/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.

    ### What are you can see
    1. Number of times kernel is invoked
  10. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 3 additions and 6 deletions.
    9 changes: 3 additions & 6 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,12 +1,9 @@
    # nvprof - NVCC Profiler

    How to use
    How to use it?
    ```
    nvprof ./executable
    ```
    ```
    /usr/local/cuda/bin/nvprof ./executable
    ```
    In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-`MAJOR.minor`/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.

    ### What are you can see
    1. Number of times kernel is invoked
    @@ -16,7 +13,7 @@ nvprof ./executable

    ## List of meterics available

    0. How to find all the metric available for the device?
    0. How to find all the metric available for the device? It is a big list see at EOF.

    ```
    nvprof --query-metrics
  11. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 5 additions and 2 deletions.
    7 changes: 5 additions & 2 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -1,10 +1,13 @@
    # nvprof - NVCC Profiler (Profiling CUDA Programs)
    Learn - How to profile using nvprof
    # nvprof - NVCC Profiler

    How to use
    ```
    nvprof ./executable
    ```
    ```
    /usr/local/cuda/bin/nvprof ./executable
    ```

    ### What are you can see
    1. Number of times kernel is invoked
    2. Kernel execution time
  12. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 123 additions and 1 deletion.
    124 changes: 123 additions & 1 deletion LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -11,7 +11,7 @@ nvprof ./executable
    3. Time taken to DtoH and HtoD
    4. ..

    ## List off meterics available
    ## List of meterics available

    0. How to find all the metric available for the device?

    @@ -26,3 +26,125 @@ nvprof --query-metrics
    2. How to query for a specific metric? say Dram reads.

    ```nvprof --metrics dram_read_transactions ```


    ### List
    - Available Metrics: Name Description
    - inst_per_warp: Average number of instructions executed by each warp
    - branch_efficiency: Ratio of non-divergent branches to total branches
    - warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
    - warp_nonpred_execution_efficiency: Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
    - inst_replay_overhead: Average number of replays for each instruction executed
    - shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
    - shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
    - local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
    - local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
    - gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load.
    - gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
    - shared_store_transactions: Number of shared memory store transactions
    - shared_load_transactions: Number of shared memory load transactions
    - local_load_transactions: Number of local memory load transactions
    - local_store_transactions: Number of local memory store transactions
    - gld_transactions: Number of global memory load transactions
    - gst_transactions: Number of global memory store transactions
    - sysmem_read_transactions: Number of system memory read transactions
    - sysmem_write_transactions: Number of system memory write transactions
    - l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
    - l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
    - dram_read_transactions: Device memory read transactions
    - dram_write_transactions: Device memory write transactions
    - global_hit_rate: Hit rate for global loads in unified l1/tex cache
    - local_hit_rate: Hit rate for local loads and stores
    - gld_requested_throughput: Requested global memory load throughput
    - gst_requested_throughput: Requested global memory store throughput
    - gld_throughput: Global memory load throughput
    - gst_throughput: Global memory store throughput
    - local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
    - tex_cache_hit_rate: Unified cache hit rate
    - l2_tex_read_hit_rate: Hit rate at L2 cache for all read requests from texture cache
    - l2_tex_write_hit_rate: Hit Rate at L2 cache for all write requests from texture cache
    - dram_read_throughput: Device memory read throughput
    - dram_write_throughput: Device memory write throughput
    - tex_cache_throughput: Unified cache throughput
    - l2_tex_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache
    - l2_tex_write_throughput: Memory write throughput seen at L2 cache for write requests from the texture cache
    - l2_read_throughput: Memory read throughput seen at L2 cache for all read requests
    - l2_write_throughput: Memory write throughput seen at L2 cache for all write requests
    - sysmem_read_throughput: System memory read throughput
    - sysmem_write_throughput: System memory write throughput
    - local_load_throughput: Local memory load throughput
    - local_store_throughput: Local memory store throughput
    - shared_load_throughput: Shared memory load throughput
    - shared_store_throughput: Shared memory store throughput
    - gld_efficiency: Ratio of requested global memory load throughput to required global memory load throughput.
    - gst_efficiency: Ratio of requested global memory store throughput to required global memory store throughput.
    - tex_cache_transactions: Unified cache read transactions
    - flop_count_dp: Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
    - flop_count_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads.
    - flop_count_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
    - flop_count_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads.
    - flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
    - flop_count_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads.
    - flop_count_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
    - flop_count_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads.
    - flop_count_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads.
    - inst_executed: The number of instructions executed
    - inst_issued: The number of instructions issued
    - dram_utilization: The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
    - sysmem_utilization: The utilization level of the system memory relative to the peak utilization
    - stall_inst_fetch: Percentage of stalls occurring because the next assembly instruction has not yet been fetched
    - stall_exec_dependency: Percentage of stalls occurring because an input required by the instruction is not yet available
    - stall_memory_dependency: Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
    - stall_texture: Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
    - stall_sync: Percentage of stalls occurring because the warp is blocked at a `__syncthreads()` call
    - stall_other: Percentage of stalls occurring due to miscellaneous reasons
    - stall_constant_memory_dependency: Percentage of stalls occurring because of immediate constant cache miss
    - stall_pipe_busy: Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
    - shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput
    - inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
    - inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
    - inst_integer: Number of integer instructions executed by non-predicated threads
    - inst_bit_convert: Number of bit-conversion instructions executed by non-predicated threads
    - inst_control: Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
    - inst_compute_ld_st: Number of compute load/store instructions executed by non-predicated threads
    - inst_misc: Number of miscellaneous instructions executed by non-predicated threads
    - inst_inter_thread_communication: Number of inter-thread communication instructions executed by non-predicated threads
    - issue_slots: The number of issue slots used
    - cf_issued: Number of issued control-flow instructions
    - cf_executed: Number of executed control-flow instructions
    - ldst_issued: Number of issued local, global, shared and texture memory load and store instructions
    - ldst_executed: Number of executed local, global, shared and texture memory load and store instructions
    - atomic_transactions: Global memory atomic and reduction transactions
    - atomic_transactions_per_request: Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
    - l2_atomic_throughput: Memory read throughput seen at L2 cache for atomic and reduction requests
    - l2_atomic_transactions: Memory read transactions seen at L2 cache for atomic and reduction requests
    - l2_tex_read_transactions: Memory read transactions seen at L2 cache for read requests from the texture cache
    - stall_memory_throttle: Percentage of stalls occurring because of memory throttle
    - stall_not_selected: Percentage of stalls occurring because warp was not selected
    - l2_tex_write_transactions: Memory write transactions seen at L2 cache for write requests from the texture cache
    - flop_count_hp: Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
    - flop_count_hp_add: Number of half-precision floating-point add operations executed by non-predicated threads.
    - flop_count_hp_mul: Number of half-precision floating-point multiply operations executed by non-predicated threads.
    - flop_count_hp_fma: Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
    - inst_fp_16: Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
    - ipc: Instructions executed per cycle
    - issued_ipc: Instructions issued per cycle
    - issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles
    - sm_efficiency: The percentage of time at least one warp is active on a specific multiprocessor
    - achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
    - eligible_warps_per_cycle: Average number of warps that are eligible to issue per active cycle
    - shared_utilization: The utilization level of the shared memory relative to peak utilization
    - l2_utilization: The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
    - tex_utilization: The utilization level of the unified cache relative to the peak utilization
    - ldst_fu_utilization: The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
    - cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
    - tex_fu_utilization: The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
    - special_fu_utilization: The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
    - half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
    - single_precision_fu_utilization: The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
    - double_precision_fu_utilization: The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
    - flop_hp_efficiency: Ratio of achieved to peak half-precision floating-point operations
    - flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
    - flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
    - sysmem_read_utilization: The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
    - sysmem_write_utilization: The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
  13. @mrprajesh mrprajesh revised this gist Jul 29, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -6,7 +6,7 @@ How to use
    nvprof ./executable
    ```
    ### What are you can see
    1. # times kernel is invoke
    1. Number of times kernel is invoked
    2. Kernel execution time
    3. Time taken to DtoH and HtoD
    4. ..
  14. @mrprajesh mrprajesh created this gist Jul 29, 2019.
    28 changes: 28 additions & 0 deletions LearnNvprof.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,28 @@
    # nvprof - NVCC Profiler (Profiling CUDA Programs)
    Learn - How to profile using nvprof

    How to use
    ```
    nvprof ./executable
    ```
    ### What are you can see
    1. # times kernel is invoke
    2. Kernel execution time
    3. Time taken to DtoH and HtoD
    4. ..

    ## List off meterics available

    0. How to find all the metric available for the device?

    ```
    nvprof --query-metrics
    ```

    1. How to query for all metric?
    ```nvprof --metrics all ```

    2. How to query for a specific metric? say Dram reads.

    ```nvprof --metrics dram_read_transactions ```