# nvprof - NVCC Profiler (Profiling CUDA Programs)
Learn - How to profile using nvprof

How to use
```
nvprof ./executable
```
### What are you can see
1. Number of times kernel is invoked
2. Kernel execution time
3. Time taken to DtoH and HtoD
4. ..

## List of meterics available

0. How to find all the metric available for the device?

```
nvprof --query-metrics

```

1. How to query for all metric?
```nvprof --metrics all ```

2. How to query for a specific metric? say  Dram reads.

 ```nvprof --metrics dram_read_transactions ```


### List
- Available Metrics:  Name   Description
- inst_per_warp:  Average number of instructions executed by each warp
- branch_efficiency:  Ratio of non-divergent branches to total branches
- warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
- warp_nonpred_execution_efficiency:  Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
- inst_replay_overhead:  Average number of replays for each instruction executed
- shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load
- shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store
- local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load
- local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store
- gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load.
- gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store
- shared_store_transactions:  Number of shared memory store transactions
- shared_load_transactions:  Number of shared memory load transactions
- local_load_transactions:  Number of local memory load transactions
- local_store_transactions:  Number of local memory store transactions
- gld_transactions:  Number of global memory load transactions
- gst_transactions:  Number of global memory store transactions
- sysmem_read_transactions:  Number of system memory read transactions
- sysmem_write_transactions:  Number of system memory write transactions
- l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
- l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
- dram_read_transactions:  Device memory read transactions
- dram_write_transactions:  Device memory write transactions
- global_hit_rate:  Hit rate for global loads in unified l1/tex cache
- local_hit_rate:  Hit rate for local loads and stores
- gld_requested_throughput:  Requested global memory load throughput
- gst_requested_throughput:  Requested global memory store throughput
- gld_throughput:  Global memory load throughput
- gst_throughput:  Global memory store throughput
- local_memory_overhead:  Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
- tex_cache_hit_rate:  Unified cache hit rate
- l2_tex_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
- l2_tex_write_hit_rate:  Hit Rate at L2 cache for all write requests from texture cache
- dram_read_throughput:  Device memory read throughput
- dram_write_throughput:  Device memory write throughput
- tex_cache_throughput:  Unified cache throughput
- l2_tex_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
- l2_tex_write_throughput:  Memory write throughput seen at L2 cache for write requests from the texture cache
- l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
- l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
- sysmem_read_throughput:  System memory read throughput
- sysmem_write_throughput:  System memory write throughput
- local_load_throughput:  Local memory load throughput
- local_store_throughput:  Local memory store throughput
- shared_load_throughput:  Shared memory load throughput
- shared_store_throughput:  Shared memory store throughput
- gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput.
- gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput.
- tex_cache_transactions:  Unified cache read transactions
- flop_count_dp:  Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
- flop_count_dp_add:  Number of double-precision floating-point add operations executed by non-predicated threads.
- flop_count_dp_fma:  Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- flop_count_dp_mul:  Number of double-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_sp:  Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
- flop_count_sp_add:  Number of single-precision floating-point add operations executed by non-predicated threads.
- flop_count_sp_fma:  Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- flop_count_sp_mul:  Number of single-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_sp_special:  Number of single-precision floating-point special operations executed by non-predicated threads.
- inst_executed:  The number of instructions executed
- inst_issued:  The number of instructions issued
- dram_utilization:  The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
- sysmem_utilization:  The utilization level of the system memory relative to the peak utilization
- stall_inst_fetch:  Percentage of stalls occurring because the next assembly instruction has not yet been fetched
- stall_exec_dependency:  Percentage of stalls occurring because an input required by the instruction is not yet available
- stall_memory_dependency:  Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
- stall_texture:  Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
- stall_sync:  Percentage of stalls occurring because the warp is blocked at a `__syncthreads()` call
- stall_other:  Percentage of stalls occurring due to miscellaneous reasons
- stall_constant_memory_dependency:  Percentage of stalls occurring because of immediate constant cache miss
- stall_pipe_busy:  Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
- shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput
- inst_fp_32:  Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- inst_fp_64:  Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- inst_integer:  Number of integer instructions executed by non-predicated threads
- inst_bit_convert:  Number of bit-conversion instructions executed by non-predicated threads
- inst_control:  Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
- inst_compute_ld_st:  Number of compute load/store instructions executed by non-predicated threads
- inst_misc:  Number of miscellaneous instructions executed by non-predicated threads
- inst_inter_thread_communication:  Number of inter-thread communication instructions executed by non-predicated threads
- issue_slots:  The number of issue slots used
- cf_issued:  Number of issued control-flow instructions
- cf_executed:  Number of executed control-flow instructions
- ldst_issued:  Number of issued local, global, shared and texture memory load and store instructions
- ldst_executed:  Number of executed local, global, shared and texture memory load and store instructions
- atomic_transactions:  Global memory atomic and reduction transactions
- atomic_transactions_per_request:  Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
- l2_atomic_throughput:  Memory read throughput seen at L2 cache for atomic and reduction requests
- l2_atomic_transactions:  Memory read transactions seen at L2 cache for atomic and reduction requests
- l2_tex_read_transactions:  Memory read transactions seen at L2 cache for read requests from the texture cache
- stall_memory_throttle:  Percentage of stalls occurring because of memory throttle
- stall_not_selected:  Percentage of stalls occurring because warp was not selected
- l2_tex_write_transactions:  Memory write transactions seen at L2 cache for write requests from the texture cache
- flop_count_hp:  Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
- flop_count_hp_add:  Number of half-precision floating-point add operations executed by non-predicated threads.
- flop_count_hp_mul:  Number of half-precision floating-point multiply operations executed by non-predicated threads.
- flop_count_hp_fma:  Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
- inst_fp_16:  Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
- ipc:  Instructions executed per cycle
- issued_ipc:  Instructions issued per cycle
- issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles
- sm_efficiency:  The percentage of time at least one warp is active on a specific multiprocessor
- achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
- eligible_warps_per_cycle:  Average number of warps that are eligible to issue per active cycle
- shared_utilization:  The utilization level of the shared memory relative to peak utilization
- l2_utilization:  The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
- tex_utilization:  The utilization level of the unified cache relative to the peak utilization
- ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
- cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
- tex_fu_utilization:  The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
- special_fu_utilization:  The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
- half_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
- single_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
- double_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
- flop_hp_efficiency:  Ratio of achieved to peak half-precision floating-point operations
- flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations
- flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations
- sysmem_read_utilization:  The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
- sysmem_write_utilization:  The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10