You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
@@ -20,11 +22,11 @@ nvprof --query-metrics
```
1. How to query for all metric?
```nvprof --metrics all ```
```nvprof --metrics all ./executable```
2. How to query for a specific metric? say Dram reads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
### What all you can see
mrprajesh
revised
this gist Jul 29, 2019.
1 changed file
with
1 addition
and
1 deletion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
mrprajesh
revised
this gist Jul 29, 2019.
1 changed file
with
3 additions
and
5 deletions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
### What are you can see
### What all you can see
1. Number of times kernel is invoked
2. Kernel execution time
3. Time taken to DtoH and HtoD
mrprajesh
revised
this gist Jul 29, 2019.
1 changed file
with
1 addition
and
1 deletion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-`MAJOR.minor`/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
### What are you can see
1. Number of times kernel is invoked
mrprajesh
revised
this gist Jul 29, 2019.
1 changed file
with
3 additions
and
6 deletions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-`MAJOR.minor`/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
### What are you can see
1. Number of times kernel is invoked
@@ -16,7 +13,7 @@ nvprof ./executable
## List of meterics available
0. How to find all the metric available for the device?
0. How to find all the metric available for the device? It is a big list see at EOF.
```
nvprof --query-metrics
mrprajesh
revised
this gist Jul 29, 2019.
1 changed file
with
5 additions
and
2 deletions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0. How to find all the metric available for the device?
@@ -26,3 +26,125 @@ nvprof --query-metrics
2. How to query for a specific metric? say Dram reads.
```nvprof --metrics dram_read_transactions ```
### List
- Available Metrics: Name Description
- inst_per_warp: Average number of instructions executed by each warp
- branch_efficiency: Ratio of non-divergent branches to total branches
- warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
- warp_nonpred_execution_efficiency: Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
- inst_replay_overhead: Average number of replays for each instruction executed
- shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
- shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
- local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
- local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
- gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load.
- gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
- shared_store_transactions: Number of shared memory store transactions
- shared_load_transactions: Number of shared memory load transactions
- local_load_transactions: Number of local memory load transactions
- local_store_transactions: Number of local memory store transactions
- gld_transactions: Number of global memory load transactions
- gst_transactions: Number of global memory store transactions
- sysmem_read_transactions: Number of system memory read transactions
- sysmem_write_transactions: Number of system memory write transactions
- l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
- l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
- flop_count_dp: Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
- flop_count_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads.
- flop_count_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- flop_count_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
- flop_count_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads.
- flop_count_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- flop_count_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads.
- inst_executed: The number of instructions executed
- inst_issued: The number of instructions issued
- dram_utilization: The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
- sysmem_utilization: The utilization level of the system memory relative to the peak utilization
- stall_inst_fetch: Percentage of stalls occurring because the next assembly instruction has not yet been fetched
- stall_exec_dependency: Percentage of stalls occurring because an input required by the instruction is not yet available
- stall_memory_dependency: Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
- stall_texture: Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
- stall_sync: Percentage of stalls occurring because the warp is blocked at a `__syncthreads()` call
- stall_other: Percentage of stalls occurring due to miscellaneous reasons
- stall_constant_memory_dependency: Percentage of stalls occurring because of immediate constant cache miss
- stall_pipe_busy: Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
- shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput
- inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- inst_integer: Number of integer instructions executed by non-predicated threads
- inst_bit_convert: Number of bit-conversion instructions executed by non-predicated threads
- inst_control: Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
- inst_compute_ld_st: Number of compute load/store instructions executed by non-predicated threads
- inst_misc: Number of miscellaneous instructions executed by non-predicated threads
- inst_inter_thread_communication: Number of inter-thread communication instructions executed by non-predicated threads
- issue_slots: The number of issue slots used
- cf_issued: Number of issued control-flow instructions
- cf_executed: Number of executed control-flow instructions
- ldst_issued: Number of issued local, global, shared and texture memory load and store instructions
- ldst_executed: Number of executed local, global, shared and texture memory load and store instructions
- atomic_transactions: Global memory atomic and reduction transactions
- atomic_transactions_per_request: Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
- l2_atomic_throughput: Memory read throughput seen at L2 cache for atomic and reduction requests
- l2_atomic_transactions: Memory read transactions seen at L2 cache for atomic and reduction requests
- l2_tex_read_transactions: Memory read transactions seen at L2 cache for read requests from the texture cache
- stall_memory_throttle: Percentage of stalls occurring because of memory throttle
- stall_not_selected: Percentage of stalls occurring because warp was not selected
- l2_tex_write_transactions: Memory write transactions seen at L2 cache for write requests from the texture cache
- flop_count_hp: Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
- flop_count_hp_add: Number of half-precision floating-point add operations executed by non-predicated threads.
- flop_count_hp_mul: Number of half-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_hp_fma: Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- inst_fp_16: Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- ipc: Instructions executed per cycle
- issued_ipc: Instructions issued per cycle
- issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles
- sm_efficiency: The percentage of time at least one warp is active on a specific multiprocessor
- achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
- eligible_warps_per_cycle: Average number of warps that are eligible to issue per active cycle
- shared_utilization: The utilization level of the shared memory relative to peak utilization
- l2_utilization: The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
- tex_utilization: The utilization level of the unified cache relative to the peak utilization
- ldst_fu_utilization: The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
- cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
- tex_fu_utilization: The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
- special_fu_utilization: The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
- half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
- single_precision_fu_utilization: The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
- double_precision_fu_utilization: The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
- flop_hp_efficiency: Ratio of achieved to peak half-precision floating-point operations
- flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
- flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
- sysmem_read_utilization: The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
- sysmem_write_utilization: The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
mrprajesh
revised
this gist Jul 29, 2019.
1 changed file
with
1 addition
and
1 deletion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters