chichunchen · March 17, 2022 21:43 · Jul 29, 2019 · Jul 29, 2019 · Jul 29, 2019 · Jul 29, 2019
diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -23,9 +23,11 @@ nvprof --query-metrics
 ```
 
 1. How to query for all metric?
+
 ```nvprof --metrics all ./executable```
 
 2. How to query for a specific metric? say  Dram reads.
+
  ```nvprof --metrics dram_read_transactions ./executable```
 
 

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -20,14 +20,12 @@ In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/
 
 ```
 nvprof --query-metrics
-
 ```
 
 1. How to query for all metric?
 ```nvprof --metrics all ./executable```
 
 2. How to query for a specific metric? say  Dram reads.
-
  ```nvprof --metrics dram_read_transactions ./executable```
 
 

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,5 +1,7 @@
 # nvprof - NVCC Profiler 
-How to use it? 
+It is Nvidia's Profiler, profiles any executable including CUDA programs.
+
+### How to use it? 
 ``` 
 nvprof ./executable
 ```

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -12,7 +12,7 @@ In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/
 3. Time taken to DtoH and HtoD
 4. ..
 
-## List of meterics available
+## How to get all/certain meterics from nvprof
 
 0. How to find all the metric available for the device? It is a big list see at EOF.
 

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,6 +1,8 @@
 # nvprof - NVCC Profiler 
 How to use it? 
-``` nvprof ./executable```
+``` 
+nvprof ./executable
+```
 
 In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
 
@@ -20,11 +22,11 @@ nvprof --query-metrics
 ```
 
 1. How to query for all metric?
-```nvprof --metrics all ```
+```nvprof --metrics all ./executable```
 
 2. How to query for a specific metric? say  Dram reads.
 
- ```nvprof --metrics dram_read_transactions ```
+ ```nvprof --metrics dram_read_transactions ./executable```
 
 
 ### List

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,6 +1,7 @@
 # nvprof - NVCC Profiler 
-How to use it? ``` nvprof ./executable```
-`
+How to use it? 
+``` nvprof ./executable```
+
 In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
 
 ### What all you can see

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,5 +1,5 @@
 # nvprof - NVCC Profiler 
-How to use it? ` nvprof ./executable`
+How to use it? ``` nvprof ./executable```
 `
 In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
 

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,11 +1,9 @@
 # nvprof - NVCC Profiler 
-How to use it?
-```
-nvprof ./executable
-```
+How to use it? ` nvprof ./executable`
+`
 In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
 
-### What are you can see
+### What all you can see
 1. Number of times kernel is invoked
 2. Kernel execution time
 3. Time taken to DtoH and HtoD

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -3,7 +3,7 @@ How to use it?
 ```
 nvprof ./executable
 ```
-In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-`MAJOR.minor`/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
+In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-<MAJOR.minor>/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
 
 ### What are you can see
 1. Number of times kernel is invoked

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,12 +1,9 @@
 # nvprof - NVCC Profiler 
-
-How to use
+How to use it?
 ```
 nvprof ./executable
 ```
-```
-/usr/local/cuda/bin/nvprof ./executable
-```
+In case if you want the obsolute url ``/usr/local/cuda/bin/nvprof or /usr/local/cuda-`MAJOR.minor`/bin/nvprof`` where `MAJOR.minor` is your CUDA version installed.
 
 ### What are you can see
 1. Number of times kernel is invoked
@@ -16,7 +13,7 @@ nvprof ./executable
 
 ## List of meterics available
 
-0. How to find all the metric available for the device?
+0. How to find all the metric available for the device? It is a big list see at EOF.
 
 ```
 nvprof --query-metrics

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -1,10 +1,13 @@
-# nvprof - NVCC Profiler (Profiling CUDA Programs)
-Learn - How to profile using nvprof
+# nvprof - NVCC Profiler 
 
 How to use
 ```
 nvprof ./executable
 ```
+```
+/usr/local/cuda/bin/nvprof ./executable
+```
+
 ### What are you can see
 1. Number of times kernel is invoked
 2. Kernel execution time

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -11,7 +11,7 @@ nvprof ./executable
 3. Time taken to DtoH and HtoD
 4. ..
 
-## List off meterics available
+## List of meterics available
 
 0. How to find all the metric available for the device?
 
@@ -26,3 +26,125 @@ nvprof --query-metrics
 2. How to query for a specific metric? say  Dram reads.
 
  ```nvprof --metrics dram_read_transactions ```
+
+
+### List
+- Available Metrics:  Name   Description
+- inst_per_warp:  Average number of instructions executed by each warp
+- branch_efficiency:  Ratio of non-divergent branches to total branches
+- warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
+- warp_nonpred_execution_efficiency:  Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
+- inst_replay_overhead:  Average number of replays for each instruction executed
+- shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load
+- shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store
+- local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load
+- local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store
+- gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load.
+- gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store
+- shared_store_transactions:  Number of shared memory store transactions
+- shared_load_transactions:  Number of shared memory load transactions
+- local_load_transactions:  Number of local memory load transactions
+- local_store_transactions:  Number of local memory store transactions
+- gld_transactions:  Number of global memory load transactions
+- gst_transactions:  Number of global memory store transactions
+- sysmem_read_transactions:  Number of system memory read transactions
+- sysmem_write_transactions:  Number of system memory write transactions
+- l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
+- l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
+- dram_read_transactions:  Device memory read transactions
+- dram_write_transactions:  Device memory write transactions
+- global_hit_rate:  Hit rate for global loads in unified l1/tex cache
+- local_hit_rate:  Hit rate for local loads and stores
+- gld_requested_throughput:  Requested global memory load throughput
+- gst_requested_throughput:  Requested global memory store throughput
+- gld_throughput:  Global memory load throughput
+- gst_throughput:  Global memory store throughput
+- local_memory_overhead:  Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
+- tex_cache_hit_rate:  Unified cache hit rate
+- l2_tex_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
+- l2_tex_write_hit_rate:  Hit Rate at L2 cache for all write requests from texture cache
+- dram_read_throughput:  Device memory read throughput
+- dram_write_throughput:  Device memory write throughput
+- tex_cache_throughput:  Unified cache throughput
+- l2_tex_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
+- l2_tex_write_throughput:  Memory write throughput seen at L2 cache for write requests from the texture cache
+- l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
+- l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
+- sysmem_read_throughput:  System memory read throughput
+- sysmem_write_throughput:  System memory write throughput
+- local_load_throughput:  Local memory load throughput
+- local_store_throughput:  Local memory store throughput
+- shared_load_throughput:  Shared memory load throughput
+- shared_store_throughput:  Shared memory store throughput
+- gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput.
+- gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput.
+- tex_cache_transactions:  Unified cache read transactions
+- flop_count_dp:  Number of double-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
+- flop_count_dp_add:  Number of double-precision floating-point add operations executed by non-predicated threads.
+- flop_count_dp_fma:  Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
+- flop_count_dp_mul:  Number of double-precision floating-point multiply operations executed by non-predicated threads.
+- flop_count_sp:  Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count. The count does not include special operations.
+- flop_count_sp_add:  Number of single-precision floating-point add operations executed by non-predicated threads.
+- flop_count_sp_fma:  Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
+- flop_count_sp_mul:  Number of single-precision floating-point multiply operations executed by non-predicated threads.
+- flop_count_sp_special:  Number of single-precision floating-point special operations executed by non-predicated threads.
+- inst_executed:  The number of instructions executed
+- inst_issued:  The number of instructions issued
+- dram_utilization:  The utilization level of the device memory relative to the peak utilization on a scale of 0 to 10
+- sysmem_utilization:  The utilization level of the system memory relative to the peak utilization
+- stall_inst_fetch:  Percentage of stalls occurring because the next assembly instruction has not yet been fetched
+- stall_exec_dependency:  Percentage of stalls occurring because an input required by the instruction is not yet available
+- stall_memory_dependency:  Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
+- stall_texture:  Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
+- stall_sync:  Percentage of stalls occurring because the warp is blocked at a `__syncthreads()` call
+- stall_other:  Percentage of stalls occurring due to miscellaneous reasons
+- stall_constant_memory_dependency:  Percentage of stalls occurring because of immediate constant cache miss
+- stall_pipe_busy:  Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
+- shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput
+- inst_fp_32:  Number of single-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
+- inst_fp_64:  Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
+- inst_integer:  Number of integer instructions executed by non-predicated threads
+- inst_bit_convert:  Number of bit-conversion instructions executed by non-predicated threads
+- inst_control:  Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
+- inst_compute_ld_st:  Number of compute load/store instructions executed by non-predicated threads
+- inst_misc:  Number of miscellaneous instructions executed by non-predicated threads
+- inst_inter_thread_communication:  Number of inter-thread communication instructions executed by non-predicated threads
+- issue_slots:  The number of issue slots used
+- cf_issued:  Number of issued control-flow instructions
+- cf_executed:  Number of executed control-flow instructions
+- ldst_issued:  Number of issued local, global, shared and texture memory load and store instructions
+- ldst_executed:  Number of executed local, global, shared and texture memory load and store instructions
+- atomic_transactions:  Global memory atomic and reduction transactions
+- atomic_transactions_per_request:  Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
+- l2_atomic_throughput:  Memory read throughput seen at L2 cache for atomic and reduction requests
+- l2_atomic_transactions:  Memory read transactions seen at L2 cache for atomic and reduction requests
+- l2_tex_read_transactions:  Memory read transactions seen at L2 cache for read requests from the texture cache
+- stall_memory_throttle:  Percentage of stalls occurring because of memory throttle
+- stall_not_selected:  Percentage of stalls occurring because warp was not selected
+- l2_tex_write_transactions:  Memory write transactions seen at L2 cache for write requests from the texture cache
+- flop_count_hp:  Number of half-precision floating-point operations executed by non-predicated threads (add, multiply, and multiply-accumulate). Each multiply-accumulate operation contributes 2 to the count.
+- flop_count_hp_add:  Number of half-precision floating-point add operations executed by non-predicated threads.
+- flop_count_hp_mul:  Number of half-precision floating-point multiply operations executed by non-predicated threads.
+- flop_count_hp_fma:  Number of half-precision floating-point multiply-accumulate operations executed by non-predicated threads. Each multiply-accumulate operation contributes 1 to the count.
+- inst_fp_16:  Number of half-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)
+- ipc:  Instructions executed per cycle
+- issued_ipc:  Instructions issued per cycle
+- issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles
+- sm_efficiency:  The percentage of time at least one warp is active on a specific multiprocessor
+- achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
+- eligible_warps_per_cycle:  Average number of warps that are eligible to issue per active cycle
+- shared_utilization:  The utilization level of the shared memory relative to peak utilization
+- l2_utilization:  The utilization level of the L2 cache relative to the peak utilization on a scale of 0 to 10
+- tex_utilization:  The utilization level of the unified cache relative to the peak utilization
+- ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
+- cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions on a scale of 0 to 10
+- tex_fu_utilization:  The utilization level of the multiprocessor function units that execute global, local and texture memory instructions on a scale of 0 to 10
+- special_fu_utilization:  The utilization level of the multiprocessor function units that execute sin, cos, ex2, popc, flo, and similar instructions on a scale of 0 to 10
+- half_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions and integer instructions on a scale of 0 to 10
+- single_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions
+- double_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
+- flop_hp_efficiency:  Ratio of achieved to peak half-precision floating-point operations
+- flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations
+- flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations
+- sysmem_read_utilization:  The read utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
+- sysmem_write_utilization:  The write utilization level of the system memory relative to the peak utilization on a scale of 0 to 10
diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -6,7 +6,7 @@ How to use
 nvprof ./executable
 ```
 ### What are you can see
-1. # times kernel is invoke
+1. Number of times kernel is invoked
 2. Kernel execution time
 3. Time taken to DtoH and HtoD
 4. ..

diff --git a/LearnNvprof.md b/LearnNvprof.md
@@ -0,0 +1,28 @@
+# nvprof - NVCC Profiler (Profiling CUDA Programs)
+Learn - How to profile using nvprof
+
+How to use
+```
+nvprof ./executable
+```
+### What are you can see
+1. # times kernel is invoke
+2. Kernel execution time
+3. Time taken to DtoH and HtoD
+4. ..
+
+## List off meterics available
+
+0. How to find all the metric available for the device?
+
+```
+nvprof --query-metrics
+
+```
+
+1. How to query for all metric?
+```nvprof --metrics all ```
+
+2. How to query for a specific metric? say  Dram reads.
+
+ ```nvprof --metrics dram_read_transactions ```