Skip to content

Instantly share code, notes, and snippets.

@vishbin
Forked from eshelman/latency.txt
Created December 9, 2018 08:55
Show Gist options
  • Save vishbin/963c5f5c2b0dc046b702bc1b6c0533d6 to your computer and use it in GitHub Desktop.
Save vishbin/963c5f5c2b0dc046b702bc1b6c0533d6 to your computer and use it in GitHub Desktop.

Revisions

  1. Eliot Eshelman revised this gist Jul 2, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -68,7 +68,7 @@ Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz).

    "Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.

    GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.
    GPU NVLink connections are not always 40GB. They range from 20GB to 150GB, depending upon the server platform design.


    Credit
  2. Eliot Eshelman revised this gist Dec 30, 2016. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -30,6 +30,8 @@ Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/
    Random Disk Access (seek+rotation) 10,000,000 ns 10,000 us 10 ms
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

    Total CPU pipeline length?


    NVIDIA Tesla GPU values
    -----------------------
    @@ -42,6 +44,7 @@ Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/s
    Floating-point add/mult operation?
    Shift operation?
    Atomic operation in GPU Global Memory?
    Total GPU pipeline length?
    Launch CUDA kernel (via dynamic parallelism)?


  3. Eliot Eshelman revised this gist Dec 30, 2016. 1 changed file with 9 additions and 2 deletions.
    11 changes: 9 additions & 2 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -9,7 +9,7 @@ L3 cache hit (shared line in another core) 25 ns 65 cycl
    Mutex lock/unlock 25 ns
    L3 cache hit (modified in another core) 29 ns 75 cycles
    L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns)
    QPI hop to a another CPU (time per hop) .   40 . ns
    QPI hop to a another CPU (time per hop)   40 ns
    64MB main memory reference (local CPU)     46   ns                     TinyMemBench on "Broadwell" E5-2690v4
    64MB main memory reference (remote CPU) 70 ns TinyMemBench on "Broadwell" E5-2690v4
    256MB main memory reference (local CPU) 75 ns TinyMemBench on "Broadwell" E5-2690v4
    @@ -35,10 +35,15 @@ NVIDIA Tesla GPU values
    -----------------------
    GPU Shared Memory access 30 ns 30~90 cycles (bank conflicts will introduce more latency)
    GPU Global Memory access 200 ns 200~800 cycles, depending upon GPU generation and access patterns
    Launch CUDA kernel on GPU 10,000 ns 10 us
    Launch CUDA kernel on GPU 10,000 ns 10 us Host CPU instructs GPU to start executing a kernel
    Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
    Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link

    Floating-point add/mult operation?
    Shift operation?
    Atomic operation in GPU Global Memory?
    Launch CUDA kernel (via dynamic parallelism)?


    Intel Xeon CPU values
    ---------------------
    @@ -58,6 +63,8 @@ Notes
    Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
    Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.

    "Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.

    GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.


  4. Eliot Eshelman revised this gist Dec 30, 2016. 1 changed file with 13 additions and 5 deletions.
    18 changes: 13 additions & 5 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,13 +1,19 @@
    Latency Comparison Numbers
    --------------------------
    L1 cache reference 1.5 ns 4 cycles
    L1 cache reference/hit 1.5 ns 4 cycles
    Floating-point add/mult/FMA operation 1.5 ns 4 cycles
    L2 cache reference 5 ns 12 ~ 17 cycles
    L2 cache reference/hit 5 ns 12 ~ 17 cycles
    Branch mispredict 6 ns 15 ~ 20 cycles
    L3 cache reference 16 ns 42 cycles
    L3 cache hit (unshared cache line) 16 ns 42 cycles
    L3 cache hit (shared line in another core) 25 ns 65 cycles
    Mutex lock/unlock 25 ns
    64MB main memory reference 46 ns TinyMemBench on "Broadwell" E5-2690v4
    256MB main memory reference 75 ns TinyMemBench on "Broadwell" E5-2690v4
    L3 cache hit (modified in another core) 29 ns 75 cycles
    L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns)
    QPI hop to a another CPU (time per hop) .   40 . ns
    64MB main memory reference (local CPU)     46   ns                     TinyMemBench on "Broadwell" E5-2690v4
    64MB main memory reference (remote CPU) 70 ns TinyMemBench on "Broadwell" E5-2690v4
    256MB main memory reference (local CPU) 75 ns TinyMemBench on "Broadwell" E5-2690v4
    256MB main memory reference (remote CPU) 120 ns TinyMemBench on "Broadwell" E5-2690v4
    Send 4KB over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR
    Compress 1KB with Google Snappy 3,000 ns 3 us
    Send 4KB over 10 Gbps ethernet 10,000 ns 10 us
    @@ -65,6 +71,8 @@ Additional Data Gathered/Correlated from:
    -----------------------------------------
    Memory latency tool: https://github.com/ssvb/tinymembench
    CPU data from Agner Fog: http://www.agner.org/optimize/
    CPU cache and QPI data: https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
    Intel performance analysis: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
    Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
    Intel SkyLake CPU data: http://www.7-cpu.com/cpu/Skylake.html
    MVAPICH2 fabric testing: http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
  5. Eliot Eshelman revised this gist Dec 30, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -21,7 +21,7 @@ Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S351
    Round trip within same datacenter 500,000 ns 500 us One-way ping across Ethernet is ~250us
    Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
    Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency)
    Disk Access (seek + rotation time) 10,000,000 ns 10,000 us 10 ms
    Random Disk Access (seek+rotation) 10,000,000 ns 10,000 us 10 ms
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms


  6. Eliot Eshelman revised this gist Dec 29, 2016. 1 changed file with 8 additions and 4 deletions.
    12 changes: 8 additions & 4 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -34,10 +34,13 @@ Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/s
    Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link


    Other useful values
    ------------------
    Intel Xeon CPU values
    ---------------------
    Wake up from C1 state 500 ns varies from <0.5us to 2us
    Wake up from C3 state 15,000 ns 15 us varies from 10us to 50us
    Wake up from C6 state 30,000 ns 30 us varies from 20us to 60us

    Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us
    Timings of C-state changes?


    Notes
    @@ -69,4 +72,5 @@ NVMe SSD: http://www.intel.com/content/dam/www/public/us/en/do
    SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
    GPU optimization: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
    CPU/GPU data locality: https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
    GPU Memory Hierarchy: https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
    GPU Memory Hierarchy: https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
    Intel Xeon C-state data: http://ena-hpc.org/2014/pdf/paper_06.pdf
  7. Eliot Eshelman revised this gist Dec 28, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -55,7 +55,7 @@ GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depend
    Credit
    ------
    Adapted from: https://gist.github.com/jboner/2841832
    Curated by Jeff Dean: http://research.google.com/people/jeff/
    Original curator: http://research.google.com/people/jeff/
    Originally by Peter Norvig: http://norvig.com/21-days.html#answers

    Additional Data Gathered/Correlated from:
  8. Eliot Eshelman revised this gist Dec 28, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -13,7 +13,7 @@ Compress 1KB with Google Snappy 3,000 ns 3 us
    Send 4KB over 10 Gbps ethernet 10,000 ns 10 us
    Write 4KB randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
    Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
    Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link
    Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 gen 3.0 link
    Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
    Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
    Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
  9. Eliot Eshelman revised this gist Dec 28, 2016. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -18,7 +18,7 @@ Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P360
    Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
    Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Round trip within same datacenter 500,000 ns 500 us
    Round trip within same datacenter 500,000 ns 500 us One-way ping across Ethernet is ~250us
    Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
    Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency)
    Disk Access (seek + rotation time) 10,000,000 ns 10,000 us 10 ms
    @@ -54,11 +54,12 @@ GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depend

    Credit
    ------
    Adapted from: https://gist.github.com/jboner/2841832
    Curated by Jeff Dean: http://research.google.com/people/jeff/
    Originally by Peter Norvig: http://norvig.com/21-days.html#answers

    Additional Data Gathered/Correlated from:
    ---------------------------------------
    -----------------------------------------
    Memory latency tool: https://github.com/ssvb/tinymembench
    CPU data from Agner Fog: http://www.agner.org/optimize/
    Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
  10. Eliot Eshelman revised this gist Dec 27, 2016. 1 changed file with 23 additions and 2 deletions.
    25 changes: 23 additions & 2 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -12,27 +12,45 @@ Send 4KB over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH
    Compress 1KB with Google Snappy 3,000 ns 3 us
    Send 4KB over 10 Gbps ethernet 10,000 ns 10 us
    Write 4KB randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
    Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
    Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link
    Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
    Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
    Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Round trip within same datacenter 500,000 ns 500 us
    Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
    Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency)
    Disk Access (seek + rotation time) 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
    Disk Access (seek + rotation time) 10,000,000 ns 10,000 us 10 ms
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms


    NVIDIA Tesla GPU values
    -----------------------
    GPU Shared Memory access 30 ns 30~90 cycles (bank conflicts will introduce more latency)
    GPU Global Memory access 200 ns 200~800 cycles, depending upon GPU generation and access patterns
    Launch CUDA kernel on GPU 10,000 ns 10 us
    Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
    Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link


    Other useful values
    ------------------
    Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us
    Timings of C-state changes?


    Notes
    -----
    1 ns = 10^-9 seconds
    1 us = 10^-6 seconds = 1,000 ns
    1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

    Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
    Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.

    GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.


    Credit
    ------
    @@ -47,4 +65,7 @@ Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_B
    Intel SkyLake CPU data: http://www.7-cpu.com/cpu/Skylake.html
    MVAPICH2 fabric testing: http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
    NVMe SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
    SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
    SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
    GPU optimization: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
    CPU/GPU data locality: https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
    GPU Memory Hierarchy: https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
  11. Eliot Eshelman revised this gist Dec 27, 2016. 1 changed file with 12 additions and 15 deletions.
    27 changes: 12 additions & 15 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -8,21 +8,21 @@ L3 cache reference 16 ns 42 cycl
    Mutex lock/unlock 25 ns
    64MB main memory reference 46 ns TinyMemBench on "Broadwell" E5-2690v4
    256MB main memory reference 75 ns TinyMemBench on "Broadwell" E5-2690v4
    Send 4K bytes over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR
    Compress 1K bytes with Google Snappy 3,000 ns 3 us
    Send 1K bytes over 1 Gbps network 10,000 ns 10 us
    Write 4K randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
    Read 4K randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
    Read 1 MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
    Write 4K randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Read 4K randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Send 4KB over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR
    Compress 1KB with Google Snappy 3,000 ns 3 us
    Send 4KB over 10 Gbps ethernet 10,000 ns 10 us
    Write 4KB randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
    Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
    Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
    Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Round trip within same datacenter 500,000 ns 500 us
    Read 1 MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
    Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
    Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
    Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency)
    Disk Access (seek + rotation time) 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

    More useful values
    Other useful values
    ------------------
    Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us
    Timings of C-state changes?
    @@ -32,9 +32,6 @@ Notes
    1 ns = 10^-9 seconds
    1 us = 10^-6 seconds = 1,000 ns
    1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

    Details
    -------
    Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.

    Credit
  12. Eliot Eshelman revised this gist Dec 27, 2016. 1 changed file with 23 additions and 9 deletions.
    32 changes: 23 additions & 9 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,18 +1,23 @@
    Latency Comparison Numbers
    --------------------------
    L1 cache reference 0.4 ns 1 cycle
    L1 cache reference 1.5 ns 4 cycles
    Floating-point add/mult/FMA operation 1.5 ns 4 cycles
    Branch mispredict 5 ns 15 ~ 20 cycles
    L2 cache reference 7 ns 14x L1 cache
    L2 cache reference 5 ns 12 ~ 17 cycles
    Branch mispredict 6 ns 15 ~ 20 cycles
    L3 cache reference 16 ns 42 cycles
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Send 1K bytes over 100 Gbps HPC fabric 1,100 ns 1 us MVAPICH2 over Intel Omni-Path
    64MB main memory reference 46 ns TinyMemBench on "Broadwell" E5-2690v4
    256MB main memory reference 75 ns TinyMemBench on "Broadwell" E5-2690v4
    Send 4K bytes over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR
    Compress 1K bytes with Google Snappy 3,000 ns 3 us
    Send 1K bytes over 1 Gbps network 10,000 ns 10 us
    Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
    Read 1 MB sequentially from memory 250,000 ns 250 us
    Write 4K randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
    Read 4K randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
    Read 1 MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
    Write 4K randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Read 4K randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
    Round trip within same datacenter 500,000 ns 500 us
    Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
    Read 1 MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
    Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
    @@ -36,4 +41,13 @@ Credit
    ------
    Curated by Jeff Dean: http://research.google.com/people/jeff/
    Originally by Peter Norvig: http://norvig.com/21-days.html#answers
    Much data from Agner Fog http://www.agner.org/optimize/

    Additional Data Gathered/Correlated from:
    ---------------------------------------
    Memory latency tool: https://github.com/ssvb/tinymembench
    CPU data from Agner Fog: http://www.agner.org/optimize/
    Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
    Intel SkyLake CPU data: http://www.7-cpu.com/cpu/Skylake.html
    MVAPICH2 fabric testing: http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
    NVMe SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
    SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
  13. Eliot Eshelman revised this gist Dec 23, 2016. 1 changed file with 16 additions and 11 deletions.
    27 changes: 16 additions & 11 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,11 +1,13 @@
    Latency Comparison Numbers
    --------------------------
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L1 cache reference 0.4 ns 1 cycle
    Floating-point add/mult/FMA operation 1.5 ns 4 cycles
    Branch mispredict 5 ns 15 ~ 20 cycles
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns 3 us
    Send 1K bytes over 100 Gbps HPC fabric 1,100 ns 1 us MVAPICH2 over Intel Omni-Path
    Compress 1K bytes with Google Snappy 3,000 ns 3 us
    Send 1K bytes over 1 Gbps network 10,000 ns 10 us
    Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
    Read 1 MB sequentially from memory 250,000 ns 250 us
    @@ -15,20 +17,23 @@ Disk seek 10,000,000 ns 10,000 us 10 ms 20x dat
    Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

    More useful values
    ------------------
    Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us
    Timings of C-state changes?

    Notes
    -----
    1 ns = 10^-9 seconds
    1 us = 10^-6 seconds = 1,000 ns
    1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

    Details
    -------
    Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.

    Credit
    ------
    By Jeff Dean: http://research.google.com/people/jeff/
    Curated by Jeff Dean: http://research.google.com/people/jeff/
    Originally by Peter Norvig: http://norvig.com/21-days.html#answers

    Contributions
    -------------
    Some updates from: https://gist.github.com/2843375
    'Humanized' comparison: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Animated presentation: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
    Much data from Agner Fog http://www.agner.org/optimize/
  14. @jboner jboner revised this gist Jan 15, 2016. 1 changed file with 20 additions and 20 deletions.
    40 changes: 20 additions & 20 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,25 +1,25 @@
    Latency Comparison Numbers
    --------------------------
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    Read 4K randomly from SSD* 150,000 ns 0.15 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory
    Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns 3 us
    Send 1K bytes over 1 Gbps network 10,000 ns 10 us
    Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
    Read 1 MB sequentially from memory 250,000 ns 250 us
    Round trip within same datacenter 500,000 ns 500 us
    Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
    Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

    Notes
    -----
    1 ns = 10^-9 seconds
    1 ms = 10^-3 seconds
    * Assuming ~1GB/sec SSD
    1 us = 10^-6 seconds = 1,000 ns
    1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

    Credit
    ------
    @@ -28,7 +28,7 @@ Originally by Peter Norvig: http://norvig.com/21-days.html#answers

    Contributions
    -------------
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
    Some updates from: https://gist.github.com/2843375
    'Humanized' comparison: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Animated presentation: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
  15. @jboner jboner revised this gist Dec 13, 2015. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -17,8 +17,8 @@ Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    Notes
    -----
    1 ns = 10-9 seconds
    1 ms = 10-3 seconds
    1 ns = 10^-9 seconds
    1 ms = 10^-3 seconds
    * Assuming ~1GB/sec SSD

    Credit
  16. @jboner jboner revised this gist Jun 7, 2012. 1 changed file with 18 additions and 9 deletions.
    27 changes: 18 additions & 9 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,25 +1,34 @@
    Latency Comparison Numbers
    --------------------------
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    Read 4K randomly from SSD 150,000 ns 0.15 ms
    Read 4K randomly from SSD* 150,000 ns 0.15 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD 1,000,000 ns 1 ms 4X memory
    Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory
    Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    Notes
    -----
    1 ns = 10-9 seconds
    1 ms = 10-3 seconds
    Assuming ~1GB/sec SSD
    * Assuming ~1GB/sec SSD

    By Jeff Dean (http://research.google.com/people/jeff/)
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
    Credit
    ------
    By Jeff Dean: http://research.google.com/people/jeff/
    Originally by Peter Norvig: http://norvig.com/21-days.html#answers

    Contributions
    -------------
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
  17. @jboner jboner revised this gist Jun 7, 2012. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -21,4 +21,5 @@ By Jeff Dean (http://research.google.com/people/jeff/)
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Visual comparison chart: http://i.imgur.com/k0t1e.png
    Nice animated presentation of the data: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
  18. @jboner jboner revised this gist Jun 2, 2012. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -5,7 +5,7 @@ Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    SSD 4K random read 150,000 ns 0.15 ms
    Read 4K randomly from SSD 150,000 ns 0.15 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD 1,000,000 ns 1 ms 4X memory
  19. @jboner jboner revised this gist Jun 2, 2012. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -5,7 +5,7 @@ Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    SSD random read 150,000 ns
    SSD 4K random read 150,000 ns 0.15 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD 1,000,000 ns 1 ms 4X memory
    @@ -20,4 +20,5 @@ Assuming ~1GB/sec SSD
    By Jeff Dean (http://research.google.com/people/jeff/)
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Visual comparison chart: http://i.imgur.com/k0t1e.png
  20. @jboner jboner revised this gist Jun 1, 2012. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -5,6 +5,7 @@ Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    SSD random read 150,000 ns
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD 1,000,000 ns 1 ms 4X memory
    @@ -19,4 +20,4 @@ Assuming ~1GB/sec SSD
    By Jeff Dean (http://research.google.com/people/jeff/)
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
  21. @jboner jboner revised this gist Jun 1, 2012. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -10,7 +10,7 @@ Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD 1,000,000 ns 1 ms 4X memory
    Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    1 ns = 10-9 seconds
    1 ms = 10-3 seconds
  22. @jboner jboner revised this gist Jun 1, 2012. 1 changed file with 6 additions and 6 deletions.
    12 changes: 6 additions & 6 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -12,11 +12,11 @@ Disk seek 10,000,000 ns 10 ms 20x datacenter
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    By Jeff Dean (http://research.google.com/people/jeff/)
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    With some updates from Brendan (http://brenocon.com/dean_perf.html)

    1 ns = 10-9 seconds
    1 ms = 10-3 seconds
    Assuming ~1GB/sec SSD

    1 ns = 10-9 seconds
    1 ms = 10-3 seconds
    By Jeff Dean (http://research.google.com/people/jeff/)
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    Some updates from: https://gist.github.com/2843375
    Great 'humanized' comparison version: https://gist.github.com/2843375
  23. @jboner jboner revised this gist Jun 1, 2012. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -13,8 +13,8 @@ Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    By Jeff Dean (http://research.google.com/people/jeff/)
    With some updates from Brendan: http://brenocon.com/dean_perf.html
    Comparisons from https://gist.github.com/2844130
    Originally by Peter Norvig (http://norvig.com/21-days.html#answers)
    With some updates from Brendan (http://brenocon.com/dean_perf.html)

    Assuming ~1GB/sec SSD

  24. @jboner jboner revised this gist Jun 1, 2012. 1 changed file with 21 additions and 13 deletions.
    34 changes: 21 additions & 13 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,14 +1,22 @@
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns
    Compress 1K bytes with Zippy 3,000 ns
    Send 2K bytes over 1 Gbps network 20,000 ns
    Read 1 MB sequentially from memory 250,000 ns
    Round trip within same datacenter 500,000 ns
    Disk seek 10,000,000 ns
    Read 1 MB sequentially from disk 20,000,000 ns
    Send packet CA->Netherlands->CA 150,000,000 ns
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    Compress 1K bytes with Zippy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially from SSD 1,000,000 ns 1 ms 4X memory
    Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    By Jeff Dean (http://research.google.com/people/jeff/):
    By Jeff Dean (http://research.google.com/people/jeff/)
    With some updates from Brendan: http://brenocon.com/dean_perf.html
    Comparisons from https://gist.github.com/2844130

    Assuming ~1GB/sec SSD

    1 ns = 10-9 seconds
    1 ms = 10-3 seconds
  25. @jboner jboner revised this gist May 31, 2012. 1 changed file with 11 additions and 11 deletions.
    22 changes: 11 additions & 11 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,14 +1,14 @@
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns
    Compress 1K bytes with Zippy 3,000 ns
    Send 2K bytes over 1 Gbps network 20,000 ns
    Read 1 MB sequentially from memory 250,000 ns
    Round trip within same datacenter 500,000 ns
    Disk seek 10,000,000 ns
    Read 1 MB sequentially from disk 20,000,000 ns
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns
    Compress 1K bytes with Zippy 3,000 ns
    Send 2K bytes over 1 Gbps network 20,000 ns
    Read 1 MB sequentially from memory 250,000 ns
    Round trip within same datacenter 500,000 ns
    Disk seek 10,000,000 ns
    Read 1 MB sequentially from disk 20,000,000 ns
    Send packet CA->Netherlands->CA 150,000,000 ns

    By Jeff Dean (http://research.google.com/people/jeff/):
  26. @jboner jboner revised this gist May 31, 2012. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,3 @@
    By Jeff Dean (http://research.google.com/people/jeff/):

    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns
    @@ -11,4 +9,6 @@ Read 1 MB sequentially from memory 250,000 ns
    Round trip within same datacenter 500,000 ns
    Disk seek 10,000,000 ns
    Read 1 MB sequentially from disk 20,000,000 ns
    Send packet CA->Netherlands->CA 150,000,000 ns
    Send packet CA->Netherlands->CA 150,000,000 ns

    By Jeff Dean (http://research.google.com/people/jeff/):
  27. @jboner jboner revised this gist May 31, 2012. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    By Jeff Dean:
    By Jeff Dean (http://research.google.com/people/jeff/):

    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
  28. @jboner jboner revised this gist May 31, 2012. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    By Jeff Dean:

    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns
  29. @jboner jboner revised this gist May 31, 2012. No changes.
  30. @jboner jboner created this gist May 31, 2012.
    12 changes: 12 additions & 0 deletions latency.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,12 @@
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns
    Compress 1K bytes with Zippy 3,000 ns
    Send 2K bytes over 1 Gbps network 20,000 ns
    Read 1 MB sequentially from memory 250,000 ns
    Round trip within same datacenter 500,000 ns
    Disk seek 10,000,000 ns
    Read 1 MB sequentially from disk 20,000,000 ns
    Send packet CA->Netherlands->CA 150,000,000 ns