vishbin · December 9, 2018 08:55 · Jul 2, 2018 · Dec 30, 2016 · Dec 30, 2016 · Dec 30, 2016
diff --git a/latency.txt b/latency.txt
@@ -68,7 +68,7 @@ Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz).
 
 "Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.
 
-GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.
+GPU NVLink connections are not always 40GB. They range from 20GB to 150GB, depending upon the server platform design.
 
 
 Credit

diff --git a/latency.txt b/latency.txt
@@ -30,6 +30,8 @@ Read 1MB sequentially from disk      5,000,000   ns    5,000 us    5 ms  ~200MB/
 Random Disk Access (seek+rotation)  10,000,000   ns   10,000 us   10 ms
 Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
 
+Total CPU pipeline length?
+
 
 NVIDIA Tesla GPU values
 -----------------------
@@ -42,6 +44,7 @@ Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/s
 Floating-point add/mult operation?
 Shift operation?
 Atomic operation in GPU Global Memory?
+Total GPU pipeline length?
 Launch CUDA kernel (via dynamic parallelism)?
 
 

diff --git a/latency.txt b/latency.txt
@@ -9,7 +9,7 @@ L3 cache hit (shared line in another core)  25   ns                      65 cycl
 Mutex lock/unlock                           25   ns
 L3 cache hit (modified in another core)     29   ns                      75 cycles
 L3 cache hit (on a remote CPU socket)       40   ns                      100 ~ 300 cycles (40 ~ 116 ns)
-QPI hop to a another CPU (time per hop) .   40 . ns
+QPI hop to a another CPU (time per hop)     40   ns
 64MB main memory reference (local CPU)      46   ns                      TinyMemBench on "Broadwell" E5-2690v4
 64MB main memory reference (remote CPU)     70   ns                      TinyMemBench on "Broadwell" E5-2690v4
 256MB main memory reference (local CPU)     75   ns                      TinyMemBench on "Broadwell" E5-2690v4
@@ -35,10 +35,15 @@ NVIDIA Tesla GPU values
 -----------------------
 GPU Shared Memory access                    30   ns                      30~90 cycles (bank conflicts will introduce more latency)
 GPU Global Memory access                   200   ns                      200~800 cycles, depending upon GPU generation and access patterns
-Launch CUDA kernel on GPU               10,000   ns       10 us
+Launch CUDA kernel on GPU               10,000   ns       10 us          Host CPU instructs GPU to start executing a kernel
 Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/sec on NVIDIA 40GB NVLink
 Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 link
 
+Floating-point add/mult operation?
+Shift operation?
+Atomic operation in GPU Global Memory?
+Launch CUDA kernel (via dynamic parallelism)?
+
 
 Intel Xeon CPU values
 ---------------------
@@ -58,6 +63,8 @@ Notes
 Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
 Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.
 
+"Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.
+
 GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.
 
 

diff --git a/latency.txt b/latency.txt
@@ -1,13 +1,19 @@
 Latency Comparison Numbers
 --------------------------
-L1 cache reference                           1.5 ns                      4 cycles
+L1 cache reference/hit                       1.5 ns                      4 cycles
 Floating-point add/mult/FMA operation        1.5 ns                      4 cycles
-L2 cache reference                           5   ns                      12 ~ 17 cycles
+L2 cache reference/hit                       5   ns                      12 ~ 17 cycles
 Branch mispredict                            6   ns                      15 ~ 20 cycles
-L3 cache reference                          16   ns                      42 cycles
+L3 cache hit (unshared cache line)          16   ns                      42 cycles
+L3 cache hit (shared line in another core)  25   ns                      65 cycles
 Mutex lock/unlock                           25   ns
-64MB main memory reference                  46   ns                      TinyMemBench on "Broadwell" E5-2690v4
-256MB main memory reference                 75   ns                      TinyMemBench on "Broadwell" E5-2690v4
+L3 cache hit (modified in another core)     29   ns                      75 cycles
+L3 cache hit (on a remote CPU socket)       40   ns                      100 ~ 300 cycles (40 ~ 116 ns)
+QPI hop to a another CPU (time per hop) .   40 . ns
+64MB main memory reference (local CPU)      46   ns                      TinyMemBench on "Broadwell" E5-2690v4
+64MB main memory reference (remote CPU)     70   ns                      TinyMemBench on "Broadwell" E5-2690v4
+256MB main memory reference (local CPU)     75   ns                      TinyMemBench on "Broadwell" E5-2690v4
+256MB main memory reference (remote CPU)   120   ns                      TinyMemBench on "Broadwell" E5-2690v4
 Send 4KB over 100 Gbps HPC fabric        1,040   ns        1 us          MVAPICH2 over Intel Omni-Path / Mellanox EDR
 Compress 1KB with Google Snappy          3,000   ns        3 us
 Send 4KB over 10 Gbps ethernet          10,000   ns       10 us
@@ -65,6 +71,8 @@ Additional Data Gathered/Correlated from:
 -----------------------------------------
 Memory latency tool:        https://github.com/ssvb/tinymembench
 CPU data from Agner Fog:    http://www.agner.org/optimize/
+CPU cache and QPI data:     https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
+Intel performance analysis: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
 Intel Broadwell CPU data:   http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
 Intel SkyLake CPU data:     http://www.7-cpu.com/cpu/Skylake.html
 MVAPICH2 fabric testing:    http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf

diff --git a/latency.txt b/latency.txt
@@ -21,7 +21,7 @@ Read 4KB randomly from SATA SSD        500,000   ns      500 us          DC S351
 Round trip within same datacenter      500,000   ns      500 us          One-way ping across Ethernet is ~250us
 Read 1MB sequentially from SATA SSD  1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
 Read 1MB sequentially from disk      5,000,000   ns    5,000 us    5 ms  ~200MB/sec server hard disk (seek time would be additional latency)
-Disk Access (seek + rotation time)  10,000,000   ns   10,000 us   10 ms
+Random Disk Access (seek+rotation)  10,000,000   ns   10,000 us   10 ms
 Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
 
 

diff --git a/latency.txt b/latency.txt
@@ -34,10 +34,13 @@ Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/s
 Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 link
 
 
-Other useful values
-------------------
+Intel Xeon CPU values
+---------------------
+Wake up from C1 state                      500   ns                      varies from <0.5us to 2us
+Wake up from C3 state                   15,000   ns       15 us          varies from 10us to 50us
+Wake up from C6 state                   30,000   ns       30 us          varies from 20us to 60us
+
 Warm up Intel SkyLake AVX units         14,000   ns       14 us          AVX units go to sleep after ~675 us
-Timings of C-state changes?
 
 
 Notes
@@ -69,4 +72,5 @@ NVMe SSD:                   http://www.intel.com/content/dam/www/public/us/en/do
 SATA SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
 GPU optimization:           https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
 CPU/GPU data locality:      https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
-GPU Memory Hierarchy:       https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
+GPU Memory Hierarchy:       https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
+Intel Xeon C-state data:    http://ena-hpc.org/2014/pdf/paper_06.pdf
diff --git a/latency.txt b/latency.txt
@@ -55,7 +55,7 @@ GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depend
 Credit
 ------
 Adapted from:               https://gist.github.com/jboner/2841832
-Curated by Jeff Dean:       http://research.google.com/people/jeff/
+Original curator:           http://research.google.com/people/jeff/
 Originally by Peter Norvig: http://norvig.com/21-days.html#answers
 
 Additional Data Gathered/Correlated from:

diff --git a/latency.txt b/latency.txt
@@ -13,7 +13,7 @@ Compress 1KB with Google Snappy          3,000   ns        3 us
 Send 4KB over 10 Gbps ethernet          10,000   ns       10 us
 Write 4KB randomly to NVMe SSD          30,000   ns       30 us          DC P3608 NVMe SSD (best case; QOS 99% is 500us)
 Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/sec on NVIDIA 40GB NVLink 
-Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 link
+Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 gen 3.0 link
 Read 4KB randomly from NVMe SSD        120,000   ns      120 us          DC P3608 NVMe SSD (QOS 99%)
 Read 1MB sequentially from NVMe SSD    208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
 Write 4KB randomly to SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)

diff --git a/latency.txt b/latency.txt
@@ -18,7 +18,7 @@ Read 4KB randomly from NVMe SSD        120,000   ns      120 us          DC P360
 Read 1MB sequentially from NVMe SSD    208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
 Write 4KB randomly to SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
 Read 4KB randomly from SATA SSD        500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
-Round trip within same datacenter      500,000   ns      500 us
+Round trip within same datacenter      500,000   ns      500 us          One-way ping across Ethernet is ~250us
 Read 1MB sequentially from SATA SSD  1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
 Read 1MB sequentially from disk      5,000,000   ns    5,000 us    5 ms  ~200MB/sec server hard disk (seek time would be additional latency)
 Disk Access (seek + rotation time)  10,000,000   ns   10,000 us   10 ms
@@ -54,11 +54,12 @@ GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depend
 
 Credit
 ------
+Adapted from:               https://gist.github.com/jboner/2841832
 Curated by Jeff Dean:       http://research.google.com/people/jeff/
 Originally by Peter Norvig: http://norvig.com/21-days.html#answers
 
 Additional Data Gathered/Correlated from:
----------------------------------------
+-----------------------------------------
 Memory latency tool:        https://github.com/ssvb/tinymembench
 CPU data from Agner Fog:    http://www.agner.org/optimize/
 Intel Broadwell CPU data:   http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt

diff --git a/latency.txt b/latency.txt
@@ -12,27 +12,45 @@ Send 4KB over 100 Gbps HPC fabric        1,040   ns        1 us          MVAPICH
 Compress 1KB with Google Snappy          3,000   ns        3 us
 Send 4KB over 10 Gbps ethernet          10,000   ns       10 us
 Write 4KB randomly to NVMe SSD          30,000   ns       30 us          DC P3608 NVMe SSD (best case; QOS 99% is 500us)
+Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/sec on NVIDIA 40GB NVLink 
+Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 link
 Read 4KB randomly from NVMe SSD        120,000   ns      120 us          DC P3608 NVMe SSD (QOS 99%)
 Read 1MB sequentially from NVMe SSD    208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
 Write 4KB randomly to SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
 Read 4KB randomly from SATA SSD        500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
 Round trip within same datacenter      500,000   ns      500 us
 Read 1MB sequentially from SATA SSD  1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
 Read 1MB sequentially from disk      5,000,000   ns    5,000 us    5 ms  ~200MB/sec server hard disk (seek time would be additional latency)
-Disk Access (seek + rotation time)  10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
+Disk Access (seek + rotation time)  10,000,000   ns   10,000 us   10 ms
 Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
 
+
+NVIDIA Tesla GPU values
+-----------------------
+GPU Shared Memory access                    30   ns                      30~90 cycles (bank conflicts will introduce more latency)
+GPU Global Memory access                   200   ns                      200~800 cycles, depending upon GPU generation and access patterns
+Launch CUDA kernel on GPU               10,000   ns       10 us
+Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/sec on NVIDIA 40GB NVLink
+Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 link
+
+
 Other useful values
 ------------------
 Warm up Intel SkyLake AVX units         14,000   ns       14 us          AVX units go to sleep after ~675 us
 Timings of C-state changes?
 
+
 Notes
 -----
 1 ns = 10^-9 seconds
 1 us = 10^-6 seconds = 1,000 ns
 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
+
 Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
+Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.
+
+GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.
+
 
 Credit
 ------
@@ -47,4 +65,7 @@ Intel Broadwell CPU data:   http://users.atw.hu/instlatx64/GenuineIntel00306D4_B
 Intel SkyLake CPU data:     http://www.7-cpu.com/cpu/Skylake.html
 MVAPICH2 fabric testing:    http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
 NVMe SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
-SATA SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
+SATA SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
+GPU optimization:           https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
+CPU/GPU data locality:      https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
+GPU Memory Hierarchy:       https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
diff --git a/latency.txt b/latency.txt
@@ -8,21 +8,21 @@ L3 cache reference                          16   ns                      42 cycl
 Mutex lock/unlock                           25   ns
 64MB main memory reference                  46   ns                      TinyMemBench on "Broadwell" E5-2690v4
 256MB main memory reference                 75   ns                      TinyMemBench on "Broadwell" E5-2690v4
-Send 4K bytes over 100 Gbps HPC fabric   1,040   ns        1 us          MVAPICH2 over Intel Omni-Path / Mellanox EDR
-Compress 1K bytes with Google Snappy     3,000   ns        3 us
-Send 1K bytes over 1 Gbps network       10,000   ns       10 us
-Write 4K randomly to NVMe SSD           30,000   ns       30 us          DC P3608 NVMe SSD (best case; QOS 99% is 500us)
-Read 4K randomly from NVMe SSD         120,000   ns      120 us          DC P3608 NVMe SSD (QOS 99%)
-Read 1 MB sequentially from NVMe SSD   208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
-Write 4K randomly to SATA SSD          500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
-Read 4K randomly from SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
+Send 4KB over 100 Gbps HPC fabric        1,040   ns        1 us          MVAPICH2 over Intel Omni-Path / Mellanox EDR
+Compress 1KB with Google Snappy          3,000   ns        3 us
+Send 4KB over 10 Gbps ethernet          10,000   ns       10 us
+Write 4KB randomly to NVMe SSD          30,000   ns       30 us          DC P3608 NVMe SSD (best case; QOS 99% is 500us)
+Read 4KB randomly from NVMe SSD        120,000   ns      120 us          DC P3608 NVMe SSD (QOS 99%)
+Read 1MB sequentially from NVMe SSD    208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
+Write 4KB randomly to SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
+Read 4KB randomly from SATA SSD        500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
 Round trip within same datacenter      500,000   ns      500 us
-Read 1 MB sequentially from SATA SSD 1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
-Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
-Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
+Read 1MB sequentially from SATA SSD  1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
+Read 1MB sequentially from disk      5,000,000   ns    5,000 us    5 ms  ~200MB/sec server hard disk (seek time would be additional latency)
+Disk Access (seek + rotation time)  10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
 Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
 
-More useful values
+Other useful values
 ------------------
 Warm up Intel SkyLake AVX units         14,000   ns       14 us          AVX units go to sleep after ~675 us
 Timings of C-state changes?
@@ -32,9 +32,6 @@ Notes
 1 ns = 10^-9 seconds
 1 us = 10^-6 seconds = 1,000 ns
 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
-
-Details
--------
 Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
 
 Credit

diff --git a/latency.txt b/latency.txt
@@ -1,18 +1,23 @@
 Latency Comparison Numbers
 --------------------------
-L1 cache reference                           0.4 ns                      1 cycle
+L1 cache reference                           1.5 ns                      4 cycles
 Floating-point add/mult/FMA operation        1.5 ns                      4 cycles
-Branch mispredict                            5   ns                      15 ~ 20 cycles
-L2 cache reference                           7   ns                      14x L1 cache
+L2 cache reference                           5   ns                      12 ~ 17 cycles
+Branch mispredict                            6   ns                      15 ~ 20 cycles
+L3 cache reference                          16   ns                      42 cycles
 Mutex lock/unlock                           25   ns
-Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
-Send 1K bytes over 100 Gbps HPC fabric   1,100   ns        1 us          MVAPICH2 over Intel Omni-Path
+64MB main memory reference                  46   ns                      TinyMemBench on "Broadwell" E5-2690v4
+256MB main memory reference                 75   ns                      TinyMemBench on "Broadwell" E5-2690v4
+Send 4K bytes over 100 Gbps HPC fabric   1,040   ns        1 us          MVAPICH2 over Intel Omni-Path / Mellanox EDR
 Compress 1K bytes with Google Snappy     3,000   ns        3 us
 Send 1K bytes over 1 Gbps network       10,000   ns       10 us
-Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
-Read 1 MB sequentially from memory     250,000   ns      250 us
+Write 4K randomly to NVMe SSD           30,000   ns       30 us          DC P3608 NVMe SSD (best case; QOS 99% is 500us)
+Read 4K randomly from NVMe SSD         120,000   ns      120 us          DC P3608 NVMe SSD (QOS 99%)
+Read 1 MB sequentially from NVMe SSD   208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
+Write 4K randomly to SATA SSD          500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
+Read 4K randomly from SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
 Round trip within same datacenter      500,000   ns      500 us
-Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
+Read 1 MB sequentially from SATA SSD 1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
 Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
 Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
 Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
@@ -36,4 +41,13 @@ Credit
 ------
 Curated by Jeff Dean:       http://research.google.com/people/jeff/
 Originally by Peter Norvig: http://norvig.com/21-days.html#answers
-Much data from Agner Fog    http://www.agner.org/optimize/
+
+Additional Data Gathered/Correlated from:
+---------------------------------------
+Memory latency tool:        https://github.com/ssvb/tinymembench
+CPU data from Agner Fog:    http://www.agner.org/optimize/
+Intel Broadwell CPU data:   http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
+Intel SkyLake CPU data:     http://www.7-cpu.com/cpu/Skylake.html
+MVAPICH2 fabric testing:    http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
+NVMe SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
+SATA SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
diff --git a/latency.txt b/latency.txt
@@ -1,11 +1,13 @@
 Latency Comparison Numbers
 --------------------------
-L1 cache reference                           0.5 ns
-Branch mispredict                            5   ns
+L1 cache reference                           0.4 ns                      1 cycle
+Floating-point add/mult/FMA operation        1.5 ns                      4 cycles
+Branch mispredict                            5   ns                      15 ~ 20 cycles
 L2 cache reference                           7   ns                      14x L1 cache
 Mutex lock/unlock                           25   ns
 Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
-Compress 1K bytes with Zippy             3,000   ns        3 us
+Send 1K bytes over 100 Gbps HPC fabric   1,100   ns        1 us          MVAPICH2 over Intel Omni-Path
+Compress 1K bytes with Google Snappy     3,000   ns        3 us
 Send 1K bytes over 1 Gbps network       10,000   ns       10 us
 Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
 Read 1 MB sequentially from memory     250,000   ns      250 us
@@ -15,20 +17,23 @@ Disk seek                           10,000,000   ns   10,000 us   10 ms  20x dat
 Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
 Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
 
+More useful values
+------------------
+Warm up Intel SkyLake AVX units         14,000   ns       14 us          AVX units go to sleep after ~675 us
+Timings of C-state changes?
+
 Notes
 -----
 1 ns = 10^-9 seconds
 1 us = 10^-6 seconds = 1,000 ns
 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
 
+Details
+-------
+Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
+
 Credit
 ------
-By Jeff Dean:               http://research.google.com/people/jeff/
+Curated by Jeff Dean:       http://research.google.com/people/jeff/
 Originally by Peter Norvig: http://norvig.com/21-days.html#answers
-
-Contributions
--------------
-Some updates from:       https://gist.github.com/2843375
-'Humanized' comparison:  https://gist.github.com/2843375
-Visual comparison chart: http://i.imgur.com/k0t1e.png
-Animated presentation:   http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
+Much data from Agner Fog    http://www.agner.org/optimize/
diff --git a/latency.txt b/latency.txt
@@ -1,25 +1,25 @@
 Latency Comparison Numbers
 --------------------------
-L1 cache reference                            0.5 ns
-Branch mispredict                             5   ns
-L2 cache reference                            7   ns             14x L1 cache
-Mutex lock/unlock                            25   ns
-Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
-Compress 1K bytes with Zippy              3,000   ns
-Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
-Read 4K randomly from SSD*              150,000   ns    0.15 ms
-Read 1 MB sequentially from memory      250,000   ns    0.25 ms
-Round trip within same datacenter       500,000   ns    0.5  ms
-Read 1 MB sequentially from SSD*      1,000,000   ns    1    ms  4X memory
-Disk seek                            10,000,000   ns   10    ms  20x datacenter roundtrip
-Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X SSD
-Send packet CA->Netherlands->CA     150,000,000   ns  150    ms
+L1 cache reference                           0.5 ns
+Branch mispredict                            5   ns
+L2 cache reference                           7   ns                      14x L1 cache
+Mutex lock/unlock                           25   ns
+Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
+Compress 1K bytes with Zippy             3,000   ns        3 us
+Send 1K bytes over 1 Gbps network       10,000   ns       10 us
+Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
+Read 1 MB sequentially from memory     250,000   ns      250 us
+Round trip within same datacenter      500,000   ns      500 us
+Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
+Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
+Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
+Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
 
 Notes
 -----
 1 ns = 10^-9 seconds
-1 ms = 10^-3 seconds
-* Assuming ~1GB/sec SSD
+1 us = 10^-6 seconds = 1,000 ns
+1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
 
 Credit
 ------
@@ -28,7 +28,7 @@ Originally by Peter Norvig: http://norvig.com/21-days.html#answers
 
 Contributions
 -------------
-Some updates from:                      https://gist.github.com/2843375
-Great 'humanized' comparison version:   https://gist.github.com/2843375
-Visual comparison chart:                http://i.imgur.com/k0t1e.png
-Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
+Some updates from:       https://gist.github.com/2843375
+'Humanized' comparison:  https://gist.github.com/2843375
+Visual comparison chart: http://i.imgur.com/k0t1e.png
+Animated presentation:   http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
diff --git a/latency.txt b/latency.txt
@@ -17,8 +17,8 @@ Send packet CA->Netherlands->CA     150,000,000   ns  150    ms
 
 Notes
 -----
-1 ns = 10-9 seconds
-1 ms = 10-3 seconds
+1 ns = 10^-9 seconds
+1 ms = 10^-3 seconds
 * Assuming ~1GB/sec SSD
 
 Credit

diff --git a/latency.txt b/latency.txt
@@ -1,25 +1,34 @@
+Latency Comparison Numbers
+--------------------------
 L1 cache reference                            0.5 ns
 Branch mispredict                             5   ns
 L2 cache reference                            7   ns             14x L1 cache
 Mutex lock/unlock                            25   ns
 Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
 Compress 1K bytes with Zippy              3,000   ns
 Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
-Read 4K randomly from SSD               150,000   ns    0.15 ms
+Read 4K randomly from SSD*              150,000   ns    0.15 ms
 Read 1 MB sequentially from memory      250,000   ns    0.25 ms
 Round trip within same datacenter       500,000   ns    0.5  ms
-Read 1 MB sequentially from SSD       1,000,000   ns    1    ms  4X memory
+Read 1 MB sequentially from SSD*      1,000,000   ns    1    ms  4X memory
 Disk seek                            10,000,000   ns   10    ms  20x datacenter roundtrip
 Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X SSD
 Send packet CA->Netherlands->CA     150,000,000   ns  150    ms
 
+Notes
+-----
 1 ns = 10-9 seconds
 1 ms = 10-3 seconds
-Assuming ~1GB/sec SSD
+* Assuming ~1GB/sec SSD
 
-By Jeff Dean (http://research.google.com/people/jeff/)
-Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
-Some updates from: https://gist.github.com/2843375
-Great 'humanized' comparison version: https://gist.github.com/2843375
-Visual comparison chart: http://i.imgur.com/k0t1e.png
-Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
+Credit
+------
+By Jeff Dean:               http://research.google.com/people/jeff/
+Originally by Peter Norvig: http://norvig.com/21-days.html#answers
+
+Contributions
+-------------
+Some updates from:                      https://gist.github.com/2843375
+Great 'humanized' comparison version:   https://gist.github.com/2843375
+Visual comparison chart:                http://i.imgur.com/k0t1e.png
+Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
diff --git a/latency.txt b/latency.txt
@@ -21,4 +21,5 @@ By Jeff Dean (http://research.google.com/people/jeff/)
 Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
 Some updates from: https://gist.github.com/2843375
 Great 'humanized' comparison version: https://gist.github.com/2843375
-Visual comparison chart: http://i.imgur.com/k0t1e.png
+Visual comparison chart: http://i.imgur.com/k0t1e.png
+Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
diff --git a/latency.txt b/latency.txt
@@ -5,7 +5,7 @@ Mutex lock/unlock                            25   ns
 Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
 Compress 1K bytes with Zippy              3,000   ns
 Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
-SSD 4K random read                      150,000   ns    0.15 ms
+Read 4K randomly from SSD               150,000   ns    0.15 ms
 Read 1 MB sequentially from memory      250,000   ns    0.25 ms
 Round trip within same datacenter       500,000   ns    0.5  ms
 Read 1 MB sequentially from SSD       1,000,000   ns    1    ms  4X memory

diff --git a/latency.txt b/latency.txt
@@ -5,7 +5,7 @@ Mutex lock/unlock                            25   ns
 Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
 Compress 1K bytes with Zippy              3,000   ns
 Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
-SSD random read                         150,000   ns
+SSD 4K random read                      150,000   ns    0.15 ms
 Read 1 MB sequentially from memory      250,000   ns    0.25 ms
 Round trip within same datacenter       500,000   ns    0.5  ms
 Read 1 MB sequentially from SSD       1,000,000   ns    1    ms  4X memory
@@ -20,4 +20,5 @@ Assuming ~1GB/sec SSD
 By Jeff Dean (http://research.google.com/people/jeff/)
 Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
 Some updates from: https://gist.github.com/2843375
-Great 'humanized' comparison version: https://gist.github.com/2843375
+Great 'humanized' comparison version: https://gist.github.com/2843375
+Visual comparison chart: http://i.imgur.com/k0t1e.png
diff --git a/latency.txt b/latency.txt
@@ -5,6 +5,7 @@ Mutex lock/unlock                            25   ns
 Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
 Compress 1K bytes with Zippy              3,000   ns
 Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
+SSD random read                         150,000   ns
 Read 1 MB sequentially from memory      250,000   ns    0.25 ms
 Round trip within same datacenter       500,000   ns    0.5  ms
 Read 1 MB sequentially from SSD       1,000,000   ns    1    ms  4X memory
@@ -19,4 +20,4 @@ Assuming ~1GB/sec SSD
 By Jeff Dean (http://research.google.com/people/jeff/)
 Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
 Some updates from: https://gist.github.com/2843375
-Great 'humanized' comparison version: https://gist.github.com/2843375
+Great 'humanized' comparison version: https://gist.github.com/2843375
diff --git a/latency.txt b/latency.txt
@@ -10,7 +10,7 @@ Round trip within same datacenter       500,000   ns    0.5  ms
 Read 1 MB sequentially from SSD       1,000,000   ns    1    ms  4X memory
 Disk seek                            10,000,000   ns   10    ms  20x datacenter roundtrip
 Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X SSD
-Send packet CA->Netherlands->CA     150,000,000   ns  150   ms
+Send packet CA->Netherlands->CA     150,000,000   ns  150    ms
 
 1 ns = 10-9 seconds
 1 ms = 10-3 seconds

diff --git a/latency.txt b/latency.txt
@@ -12,11 +12,11 @@ Disk seek                            10,000,000   ns   10    ms  20x datacenter
 Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X SSD
 Send packet CA->Netherlands->CA     150,000,000   ns  150   ms
 
-By Jeff Dean (http://research.google.com/people/jeff/)
-Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
-With some updates from Brendan (http://brenocon.com/dean_perf.html)
-
+1 ns = 10-9 seconds
+1 ms = 10-3 seconds
 Assuming ~1GB/sec SSD
 
-1 ns = 10-9 seconds
-1 ms = 10-3 seconds
+By Jeff Dean (http://research.google.com/people/jeff/)
+Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
+Some updates from: https://gist.github.com/2843375
+Great 'humanized' comparison version: https://gist.github.com/2843375
diff --git a/latency.txt b/latency.txt
@@ -13,8 +13,8 @@ Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X
 Send packet CA->Netherlands->CA     150,000,000   ns  150   ms
 
 By Jeff Dean (http://research.google.com/people/jeff/)
-With some updates from Brendan: http://brenocon.com/dean_perf.html
-Comparisons from https://gist.github.com/2844130
+Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
+With some updates from Brendan (http://brenocon.com/dean_perf.html)
 
 Assuming ~1GB/sec SSD
 

diff --git a/latency.txt b/latency.txt
@@ -1,14 +1,22 @@
-L1 cache reference                          0.5 ns
-Branch mispredict                             5 ns
-L2 cache reference                            7 ns
-Mutex lock/unlock                            25 ns
-Main memory reference                       100 ns
-Compress 1K bytes with Zippy              3,000 ns
-Send 2K bytes over 1 Gbps network        20,000 ns
-Read 1 MB sequentially from memory      250,000 ns
-Round trip within same datacenter       500,000 ns
-Disk seek                            10,000,000 ns
-Read 1 MB sequentially from disk     20,000,000 ns
-Send packet CA->Netherlands->CA     150,000,000 ns
+L1 cache reference                            0.5 ns
+Branch mispredict                             5   ns
+L2 cache reference                            7   ns             14x L1 cache
+Mutex lock/unlock                            25   ns
+Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
+Compress 1K bytes with Zippy              3,000   ns
+Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
+Read 1 MB sequentially from memory      250,000   ns    0.25 ms
+Round trip within same datacenter       500,000   ns    0.5  ms
+Read 1 MB sequentially from SSD       1,000,000   ns    1    ms  4X memory
+Disk seek                            10,000,000   ns   10    ms  20x datacenter roundtrip
+Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X SSD
+Send packet CA->Netherlands->CA     150,000,000   ns  150   ms
 
-By Jeff Dean (http://research.google.com/people/jeff/):
+By Jeff Dean (http://research.google.com/people/jeff/)
+With some updates from Brendan: http://brenocon.com/dean_perf.html
+Comparisons from https://gist.github.com/2844130
+
+Assuming ~1GB/sec SSD
+
+1 ns = 10-9 seconds
+1 ms = 10-3 seconds
diff --git a/latency.txt b/latency.txt
@@ -1,14 +1,14 @@
-L1 cache reference                  0.5 ns
-Branch mispredict                   5 ns
-L2 cache reference                  7 ns
-Mutex lock/unlock                   25 ns
-Main memory reference               100 ns
-Compress 1K bytes with Zippy        3,000 ns
-Send 2K bytes over 1 Gbps network   20,000 ns
-Read 1 MB sequentially from memory  250,000 ns
-Round trip within same datacenter   500,000 ns
-Disk seek                           10,000,000 ns
-Read 1 MB sequentially from disk    20,000,000 ns
+L1 cache reference                          0.5 ns
+Branch mispredict                             5 ns
+L2 cache reference                            7 ns
+Mutex lock/unlock                            25 ns
+Main memory reference                       100 ns
+Compress 1K bytes with Zippy              3,000 ns
+Send 2K bytes over 1 Gbps network        20,000 ns
+Read 1 MB sequentially from memory      250,000 ns
+Round trip within same datacenter       500,000 ns
+Disk seek                            10,000,000 ns
+Read 1 MB sequentially from disk     20,000,000 ns
 Send packet CA->Netherlands->CA     150,000,000 ns
 
 By Jeff Dean (http://research.google.com/people/jeff/):
diff --git a/latency.txt b/latency.txt
@@ -1,5 +1,3 @@
-By Jeff Dean (http://research.google.com/people/jeff/):
-
 L1 cache reference                  0.5 ns
 Branch mispredict                   5 ns
 L2 cache reference                  7 ns
@@ -11,4 +9,6 @@ Read 1 MB sequentially from memory  250,000 ns
 Round trip within same datacenter   500,000 ns
 Disk seek                           10,000,000 ns
 Read 1 MB sequentially from disk    20,000,000 ns
-Send packet CA->Netherlands->CA     150,000,000 ns
+Send packet CA->Netherlands->CA     150,000,000 ns
+
+By Jeff Dean (http://research.google.com/people/jeff/):
diff --git a/latency.txt b/latency.txt
@@ -1,4 +1,4 @@
-By Jeff Dean:
+By Jeff Dean (http://research.google.com/people/jeff/):
 
 L1 cache reference                  0.5 ns
 Branch mispredict                   5 ns

diff --git a/latency.txt b/latency.txt
@@ -1,3 +1,5 @@
+By Jeff Dean:
+
 L1 cache reference                  0.5 ns
 Branch mispredict                   5 ns
 L2 cache reference                  7 ns

diff --git a/latency.txt b/latency.txt
@@ -0,0 +1,12 @@
+L1 cache reference                  0.5 ns
+Branch mispredict                   5 ns
+L2 cache reference                  7 ns
+Mutex lock/unlock                   25 ns
+Main memory reference               100 ns
+Compress 1K bytes with Zippy        3,000 ns
+Send 2K bytes over 1 Gbps network   20,000 ns
+Read 1 MB sequentially from memory  250,000 ns
+Round trip within same datacenter   500,000 ns
+Disk seek                           10,000,000 ns
+Read 1 MB sequentially from disk    20,000,000 ns
+Send packet CA->Netherlands->CA     150,000,000 ns