# nvprof - NVCC Profiler (Profiling CUDA Programs) Learn - How to profile using nvprof How to use ``` nvprof ./executable ``` ### What are you can see 1. Number of times kernel is invoked 2. Kernel execution time 3. Time taken to DtoH and HtoD 4. .. ## List of meterics available 0. How to find all the metric available for the device? ``` nvprof --query-metrics ``` 1. How to query for all metric? ```nvprof --metrics all ``` 2. How to query for a specific metric? say Dram reads. ```nvprof --metrics dram_read_transactions ``` ### List - Available Metrics: Name Description - inst_per_warp: Average number of instructions executed by each warp - branch_efficiency: Ratio of non-divergent branches to total branches - warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor - warp_nonpred_execution_efficiency: Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor - inst_replay_overhead: Average number of replays for each instruction executed - shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load - shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store - local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load - local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store - gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load. - gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store - shared_store_transactions: Number of shared memory store transactions - shared_load_transactions: Number of shared memory load transactions - local_load_transactions: Number of local memory load transactions - local_store_transactions: Number of local memory store transactions - gld_transactions: Number of global memory load transactions - gst_transactions: Number of global memory store transactions - sysmem_read_transactions: Number of system memory read transactions - sysmem_write_transactions: Number of system memory write transactions - l2_read_transactions: Memory read transactions seen at L2 cache for all read requests - l2_write_transactions: Memory write transactions seen at L2 cache for all write requests - dram_read_transactions: Device memory read transactions - dram_write_transactions: Device memory write transactions - global_hit_rate: Hit rate for global loads in unified l1/tex cache - local_hit_rate: Hit rate for local loads and stores - gld_requested_throughput: Requested global memory load throughput - gst_requested_throughput: Requested global memory store throughput - gld_throughput: Global memory load throughput - gst_throughput: Global memory store throughput - local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches - tex_cache_hit_rate: Unified cache hit rate - l2_tex_read_hit_rate: Hit rate at L2 cache for all read requests from texture cache - l2_tex_write_hit_rate: Hit Rate at L2 cache for all write requests from texture cache - dram_read_throughput: Device memory read throughput - dram_write_throughput: Device memory write throughput - tex_cache_throughput: Unified cache throughput - l2_tex_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache - l2_tex_write_throughput: Memory write throughput seen at L2 cache for write requests from the texture cache - l2_read_throughput: Memory read throughput seen at L2 cache for all read requests - l2_write_throughput: Memory write throughput seen at L2 cache for all write requests - sysmem_read_throughput: System memory read throughput - sysmem_write_throughput: System memory write throughput - local_load_throughput: Local memory load throughput - local_store_throughput: Local memory store throughput - shared_load_throughput: Shared memory load throughput - shared_store_throughput: Shared memory store throughput - gld_efficiency: Ratio of requested global memory load throughput to required global memory load throughput. - gst_efficiency: Ratio of requested global memory store throughput to required global memory store throughput. - tex_cache_transactions: Unified cache read transactions - flop_count_dp: Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. - flop_count_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads. - flop_count_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count. - flop_count_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads. - flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations. - flop_count_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads. - flop_count_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count. - flop_count_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads. - flop_count_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads. - inst_executed: The number of instructions executed - inst_issued: The number of instructions issued - dram_utilization: The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10 - sysmem_utilization: The utilization level of the system memory relative to the peak utilization - stall_inst_fetch: Percentage of stalls occurring because the next assembly instruction has not yet been fetched - stall_exec_dependency: Percentage of stalls occurring because an input required by the instruction is not yet available - stall_memory_dependency: Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding - stall_texture: Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests - stall_sync: Percentage of stalls occurring because the warp is blocked at a `__syncthreads()` call - stall_other: Percentage of stalls occurring due to miscellaneous reasons - stall_constant_memory_dependency: Percentage of stalls occurring because of immediate constant cache miss - stall_pipe_busy: Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy - shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput - inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.) - inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.) - inst_integer: Number of integer instructions executed by non-predicated threads - inst_bit_convert: Number of bit-conversion instructions executed by non-predicated threads - inst_control: Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.) - inst_compute_ld_st: Number of compute load/store instructions executed by non-predicated threads - inst_misc: Number of miscellaneous instructions executed by non-predicated threads - inst_inter_thread_communication: Number of inter-thread communication instructions executed by non-predicated threads - issue_slots: The number of issue slots used - cf_issued: Number of issued control-flow instructions - cf_executed: Number of executed control-flow instructions - ldst_issued: Number of issued local, global, shared and texture memory load and store instructions - ldst_executed: Number of executed local, global, shared and texture memory load and store instructions - atomic_transactions: Global memory atomic and reduction transactions - atomic_transactions_per_request: Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction - l2_atomic_throughput: Memory read throughput seen at L2 cache for atomic and reduction requests - l2_atomic_transactions: Memory read transactions seen at L2 cache for atomic and reduction requests - l2_tex_read_transactions: Memory read transactions seen at L2 cache for read requests from the texture cache - stall_memory_throttle: Percentage of stalls occurring because of memory throttle - stall_not_selected: Percentage of stalls occurring because warp was not selected - l2_tex_write_transactions: Memory write transactions seen at L2 cache for write requests from the texture cache - flop_count_hp: Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. - flop_count_hp_add: Number of half-precision floating-point add operations executed by non-predicated threads. - flop_count_hp_mul: Number of half-precision floating-point multiply operations executed by non-predicated threads. - flop_count_hp_fma: Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count. - inst_fp_16: Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.) - ipc: Instructions executed per cycle - issued_ipc: Instructions issued per cycle - issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles - sm_efficiency: The percentage of time at least one warp is active on a specific multiprocessor - achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor - eligible_warps_per_cycle: Average number of warps that are eligible to issue per active cycle - shared_utilization: The utilization level of the shared memory relative to peak utilization - l2_utilization: The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10 - tex_utilization: The utilization level of the unified cache relative to the peak utilization - ldst_fu_utilization: The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions - cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10 - tex_fu_utilization: The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10 - special_fu_utilization: The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10 - half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10 - single_precision_fu_utilization: The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions - double_precision_fu_utilization: The utilization level of the multiprocessor function units that execute double-precision floating-point instructions - flop_hp_efficiency: Ratio of achieved to peak half-precision floating-point operations - flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations - flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations - sysmem_read_utilization: The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10 - sysmem_write_utilization: The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10