Skip to content

Instantly share code, notes, and snippets.

@iakashpaul
Last active February 22, 2024 08:37
Show Gist options
  • Select an option

  • Save iakashpaul/3dfe37f10a1d3eb38cb610b85a9e5ccf to your computer and use it in GitHub Desktop.

Select an option

Save iakashpaul/3dfe37f10a1d3eb38cb610b85a9e5ccf to your computer and use it in GitHub Desktop.

Revisions

  1. iakashpaul revised this gist Feb 22, 2024. 1 changed file with 9 additions and 9 deletions.
    18 changes: 9 additions & 9 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -7,18 +7,18 @@ DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${D
    ```

    | Device | Type | FP32-TFLOPS | BW | BF16 | F16 | INT8 |
    | Device | Type | FP32 (TFLOPS) | BW | F16 | BF16 | INT8 |
    |-|-|-|-|-|-|-|
    | Apple M1 Pro CPU 10-core | CPU | 0.33 | 96 | | | |
    | Apple M1 Pro GPU 16-core | GPU | 3.74 | 176 | | | |
    | Apple M1 Pro CPU 10-core | CPU | 0.33 | 96 | | | 0.008 |
    | Apple M1 Pro GPU 16-core | GPU | 3.74 | 176 | 4.3 | | |
    | Intel Xeon 8358 60-core | CPU | 3.5 | 96 | | | |
    | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | | | |
    | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | | | |
    | AMD Ryzen 5 3600 6-core | CPU | | | | | |
    | Nvidia A100 80GB | GPU | 18.9 | 1490 | 33| | |
    | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | NA | 0.75 | 0.02 |
    | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | NA | 0.61 | 0.014 |
    | AMD Ryzen 5 3600 6-core | CPU | 0.36 | 14 | | | |
    | Nvidia A100 80GB | GPU | 19 | 1490 | 32| 33 | NA | * revise these with idle card
    | Nvidia A10 24GB | GPU | 14.48 | 469 | | | |
    | Nvidia V100 32GB | GPU | 13 | 766 | | | |
    | Nvidia RTX 2070S 8GB | GPU | | | | | |
    | Nvidia V100 32GB | GPU | 13 | 766 | 84 | 9.4 | NA |
    | Nvidia RTX 2070S 8GB | GPU | 8 | 376 | 37 | 5 | NA |

    ## Ryzen 5 3600
    ```
  2. iakashpaul revised this gist Feb 20, 2024. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -472,10 +472,10 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.025612187385559083, 14.822047421613195
    0.268435456, 0.027147817611694335, 19.775840536394895
    0.37962506, 0.0396291971206665, 19.15885698335416
    ```

    ## XEON 6230 INT8
    ```
    benchmarking cpu using torch.int8
    size, elapsed_time, tops
    256, 0.0011035919189453125, 0.030404746015236777
  3. iakashpaul revised this gist Feb 20, 2024. 1 changed file with 9 additions and 9 deletions.
    18 changes: 9 additions & 9 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -21,7 +21,7 @@ DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${D
    | Nvidia RTX 2070S 8GB | GPU | | | | | |

    ## Ryzen 5 3600

    ```
    benchmarking cpu using torch.float32
    size, elapsed_time, tops
    256, 0.0007100820541381836, 0.04725430223796394
    @@ -59,7 +59,7 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.02665371894836426, 14.242855067821507
    0.268435456, 0.03752543926239014, 14.306852166233767
    0.37962506, 0.053798246383666995, 14.1129157739704

    ```

    ## A100 float16
    ```
    @@ -232,7 +232,7 @@ size (GB), elapsed_time, bandwidth (GB/s)
    Need to revise torch & cuda versions

    ## RTX 2070S F32

    ```
    benchmarking cuda using torch.float32
    size, elapsed_time, tops
    256, 0.014125776290893555, 0.00237540445983358
    @@ -270,9 +270,9 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.0009937524795532227, 382.0116817929089
    0.268435456, 0.0014146089553833008, 379.51895466018027
    0.37962506, 0.002018284797668457, 376.1858192050465

    ## RTYX 2070S float16

    ```
    ## RTX 2070S float16
    ```
    benchmarking cuda using torch.float16
    size, elapsed_time, tops
    256, 0.005084848403930664, 0.006598905087133359
    @@ -310,9 +310,9 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.001004338264465332, 377.9852559954953
    0.268435456, 0.0014100074768066406, 380.75749301407643
    0.37962506, 0.0019683837890625, 385.7226035993798

    ```
    ## RTX 2070S bfloat16

    ```
    benchmarking cuda using torch.bfloat16
    size, elapsed_time, tops
    256, 0.0027062654495239257, 0.012398795545316055
    @@ -350,7 +350,7 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.001009511947631836, 376.0481061076529
    0.268435456, 0.001414942741394043, 379.4294258657132
    0.37962506, 0.0020305871963500976, 373.9066814588031

    ```


    ## XEON 6330 bfloat16
  4. iakashpaul revised this gist Feb 20, 2024. 1 changed file with 165 additions and 0 deletions.
    165 changes: 165 additions & 0 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -14,9 +14,52 @@ DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${D
    | Intel Xeon 8358 60-core | CPU | 3.5 | 96 | | | |
    | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | | | |
    | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | | | |
    | AMD Ryzen 5 3600 6-core | CPU | | | | | |
    | Nvidia A100 80GB | GPU | 18.9 | 1490 | 33| | |
    | Nvidia A10 24GB | GPU | 14.48 | 469 | | | |
    | Nvidia V100 32GB | GPU | 13 | 766 | | | |
    | Nvidia RTX 2070S 8GB | GPU | | | | | |

    ## Ryzen 5 3600

    benchmarking cpu using torch.float32
    size, elapsed_time, tops
    256, 0.0007100820541381836, 0.04725430223796394
    304, 0.00020635128021240234, 0.27229745287823454
    362, 0.0003533363342285156, 0.268514293066278
    430, 0.0005476951599121093, 0.2903330385930698
    512, 0.0006838560104370118, 0.3925321294294962
    608, 0.001193690299987793, 0.37657290505300817
    724, 0.0021503925323486327, 0.3529619995336488
    861, 0.003527235984802246, 0.36191362514452513
    1024, 0.00509192943572998, 0.421742617431252
    1217, 0.008618521690368652, 0.418281783757488
    1448, 0.013982748985290528, 0.43425329242394584
    1722, 0.02307753562927246, 0.4425272377457036
    2048, 0.0389744758605957, 0.44079795313859077
    2435, 0.0737607479095459, 0.3914727896388784
    2896, 0.14375429153442382, 0.3379129607436293
    3444, 0.24540278911590577, 0.3329200334777477
    4096, 0.37633934020996096, 0.3651995387867831
    4870, 0.6344788789749145, 0.36408242047901673
    5792, 0.9949984312057495, 0.3905649436100884
    6888, 1.7714987516403198, 0.3689508883586864
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 5.555152893066406e-05, 151.00588879327037
    0.00593164, 8.046627044677734e-05, 147.43171187294814
    0.008388608, 0.00013594627380371095, 123.41063517654155
    0.01186328, 0.0003290891647338867, 72.09766392395856
    0.016777216, 0.0010283470153808593, 32.629483528546785
    0.023726564, 0.0030118942260742186, 15.755243855907796
    0.033554432, 0.004802894592285156, 13.97258730345578
    0.047453132, 0.006703615188598633, 14.157474934034783
    0.067108864, 0.009397172927856445, 14.282777281040833
    0.094906264, 0.013365435600280761, 14.201746480751643
    0.134217728, 0.018894267082214356, 14.207243648666582
    0.189812528, 0.02665371894836426, 14.242855067821507
    0.268435456, 0.03752543926239014, 14.306852166233767
    0.37962506, 0.053798246383666995, 14.1129157739704


    ## A100 float16
    ```
    @@ -188,6 +231,128 @@ size (GB), elapsed_time, bandwidth (GB/s)

    Need to revise torch & cuda versions

    ## RTX 2070S F32

    benchmarking cuda using torch.float32
    size, elapsed_time, tops
    256, 0.014125776290893555, 0.00237540445983358
    304, 5.047321319580078e-05, 1.1132425388101652
    362, 5.1856040954589844e-05, 1.8296008382722941
    430, 6.949901580810547e-05, 2.2880036235197254
    512, 7.755756378173828e-05, 3.4611125325626313
    608, 0.00010981559753417969, 4.093329491378411
    724, 0.00015544891357421875, 4.882677083732809
    861, 0.00023860931396484374, 5.349978761466475
    1024, 0.00034000873565673826, 6.315966099671126
    1217, 0.0005458593368530273, 6.604211712825641
    1448, 0.0008722305297851563, 6.961525166397971
    1722, 0.0014461994171142579, 7.061569777408616
    2048, 0.0022562503814697265, 7.614345165366353
    2435, 0.004026150703430176, 7.171943595007254
    2896, 0.006442856788635254, 7.539580634119545
    3444, 0.009186863899230957, 8.893078820383868
    4096, 0.015436434745788574, 8.903542543040686
    4870, 0.02744767665863037, 8.416107813896367
    5792, 0.04533388614654541, 8.572208103222879
    6888, 0.08088059425354004, 8.080999455755025
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 3.304481506347656e-05, 253.85549847642136
    0.00593164, 4.2486190795898435e-05, 279.2267270320988
    0.008388608, 5.2881240844726565e-05, 317.26214687855725
    0.01186328, 7.462501525878906e-05, 317.9437875854313
    0.016777216, 9.911060333251953e-05, 338.55542062864566
    0.023726564, 0.00013742446899414062, 345.30333897104794
    0.033554432, 0.000186920166015625, 359.0242049880816
    0.047453132, 0.0002597332000732422, 365.39904784308425
    0.067108864, 0.00036420822143554685, 368.51921538446715
    0.094906264, 0.0005475997924804688, 346.6263694882062
    0.134217728, 0.0007352352142333985, 365.10146794300016
    0.189812528, 0.0009937524795532227, 382.0116817929089
    0.268435456, 0.0014146089553833008, 379.51895466018027
    0.37962506, 0.002018284797668457, 376.1858192050465

    ## RTYX 2070S float16

    benchmarking cuda using torch.float16
    size, elapsed_time, tops
    256, 0.005084848403930664, 0.006598905087133359
    304, 2.4533271789550783e-05, 2.290315310652206
    362, 0.0006063461303710937, 0.15647144633698648
    430, 0.00015423297882080078, 1.0309986957118566
    512, 3.454685211181641e-05, 7.77018569249568
    608, 5.1641464233398436e-05, 8.704467053226667
    724, 6.458759307861328e-05, 11.751588994439986
    861, 0.000567626953125, 2.2489326043664515
    1024, 9.047985076904297e-05, 23.734385387986805
    1217, 0.0002730607986450195, 13.202080430030826
    1448, 0.00027284622192382815, 22.254494642389318
    1722, 0.0004123687744140625, 24.765304091006698
    2048, 0.00050201416015625, 34.22188166694906
    2435, 0.0011893272399902343, 24.27870545556251
    2896, 0.0013867616653442383, 35.02868552394098
    3444, 0.002403569221496582, 33.990909867422005
    4096, 0.0035764694213867186, 38.428667291306034
    4870, 0.006640505790710449, 34.78689926348
    5792, 0.010618138313293456, 36.59883632232168
    6888, 0.017475819587707518, 37.40002206269852
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 3.1566619873046876e-05, 265.74299160749246
    0.00593164, 4.220008850097656e-05, 281.1197895882486
    0.008388608, 5.3262710571289064e-05, 314.98990231720677
    0.01186328, 7.414817810058594e-05, 319.9884421679743
    0.016777216, 9.582042694091796e-05, 350.1803641585668
    0.023726564, 0.00013625621795654297, 348.26394502696763
    0.033554432, 0.0001840829849243164, 364.55766961618446
    0.047453132, 0.00025680065155029295, 369.5717414541417
    0.067108864, 0.00035834312438964844, 374.5508672131151
    0.094906264, 0.0005070447921752929, 374.3506114828194
    0.134217728, 0.000709366798400879, 378.4155906438423
    0.189812528, 0.001004338264465332, 377.9852559954953
    0.268435456, 0.0014100074768066406, 380.75749301407643
    0.37962506, 0.0019683837890625, 385.7226035993798

    ## RTX 2070S bfloat16

    benchmarking cuda using torch.bfloat16
    size, elapsed_time, tops
    256, 0.0027062654495239257, 0.012398795545316055
    304, 4.4178962707519534e-05, 1.2718480597199782
    362, 5.137920379638672e-05, 1.8465808924557958
    430, 6.778240203857422e-05, 2.34594814018994
    512, 8.475780487060547e-05, 3.167088345548872
    608, 0.00012900829315185547, 3.484360679595077
    724, 0.00019524097442626953, 3.8875387209595704
    861, 0.0002892017364501953, 4.4140632683228755
    1024, 0.00046432018280029297, 4.6250060358105225
    1217, 0.0008016824722290039, 4.496756198219868
    1448, 0.0012295007705688476, 4.93863438669556
    1722, 0.002044367790222168, 4.995401583239668
    2048, 0.0033872127532958984, 5.071978182440201
    2435, 0.005963873863220215, 4.841706315768501
    2896, 0.009510970115661621, 5.107411513364935
    3444, 0.01592259407043457, 5.1310423670035945
    4096, 0.027903199195861816, 4.9255625674773125
    4870, 0.04716935157775879, 4.8973029790157625
    5792, 0.08068933486938476, 4.816144621901542
    6888, 0.13136739730834962, 4.975329126829385
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 3.173351287841797e-05, 264.3453951076784
    0.00593164, 4.298686981201172e-05, 275.9745022580144
    0.008388608, 5.3429603576660155e-05, 314.0059981154128
    0.01186328, 7.274150848388672e-05, 326.1763537012127
    0.016777216, 9.78231430053711e-05, 343.0111829279259
    0.023726564, 0.00013377666473388672, 354.7190243858706
    0.033554432, 0.0001855134963989258, 361.7465322075003
    0.047453132, 0.0002609729766845703, 363.66318538302215
    0.067108864, 0.00036041736602783204, 372.3952857189337
    0.094906264, 0.000502777099609375, 377.5281892263429
    0.134217728, 0.0007108211517333985, 377.641345288329
    0.189812528, 0.001009511947631836, 376.0481061076529
    0.268435456, 0.001414942741394043, 379.4294258657132
    0.37962506, 0.0020305871963500976, 373.9066814588031



    ## XEON 6330 bfloat16
    ```
    benchmarking cpu using torch.bfloat16
  5. iakashpaul revised this gist Feb 19, 2024. 1 changed file with 7 additions and 0 deletions.
    7 changes: 7 additions & 0 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,11 @@
    # Runs for dtypes
    ```
    DEVICE=cuda && DTYPE=float32 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
    DEVICE=cuda && DTYPE=float16 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
    DEVICE=cuda && DTYPE=bfloat16 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
    DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
    ```

    | Device | Type | FP32-TFLOPS | BW | BF16 | F16 | INT8 |
    |-|-|-|-|-|-|-|
  6. iakashpaul revised this gist Feb 19, 2024. 1 changed file with 11 additions and 0 deletions.
    11 changes: 11 additions & 0 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,16 @@
    # Runs for dtypes

    | Device | Type | FP32-TFLOPS | BW | BF16 | F16 | INT8 |
    |-|-|-|-|-|-|-|
    | Apple M1 Pro CPU 10-core | CPU | 0.33 | 96 | | | |
    | Apple M1 Pro GPU 16-core | GPU | 3.74 | 176 | | | |
    | Intel Xeon 8358 60-core | CPU | 3.5 | 96 | | | |
    | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | | | |
    | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | | | |
    | Nvidia A100 80GB | GPU | 18.9 | 1490 | 33| | |
    | Nvidia A10 24GB | GPU | 14.48 | 469 | | | |
    | Nvidia V100 32GB | GPU | 13 | 766 | | | |

    ## A100 float16
    ```
    benchmarking cuda using torch.float16
  7. iakashpaul revised this gist Feb 19, 2024. 1 changed file with 39 additions and 2 deletions.
    41 changes: 39 additions & 2 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -86,8 +86,45 @@ size (GB), elapsed_time, bandwidth (GB/s)
    Need to revise torch & cuda versions

    ## V100 float16

    Need to revise torch & cuda versions
    ```
    benchmarking cuda using torch.float16
    size, elapsed_time, tops
    256, 0.005288243293762207, 0.006345099901053988
    304, 4.7779083251953126e-05, 1.1760151969366865
    362, 5.8317184448242186e-05, 1.6268936317425347
    430, 5.1641464233398436e-05, 3.079192318818098
    512, 6.201267242431641e-05, 4.328719365023544
    608, 4.842281341552734e-05, 9.283050535346607
    724, 0.0006131649017333985, 1.2378510998498296
    861, 0.00011780261993408204, 10.836386853826447
    1024, 7.870197296142579e-05, 27.286274628115695
    1217, 0.00018236637115478515, 19.767737895822073
    1448, 0.0001292705535888672, 46.97167773653695
    1722, 0.0002488374710083008, 41.04059591434817
    2048, 0.0002832174301147461, 60.65964646681365
    2435, 0.0006921768188476562, 41.71668996091485
    2896, 0.0007654905319213867, 63.457921746037535
    3444, 0.0011915206909179688, 68.5674242929489
    4096, 0.0020316600799560546, 67.64859674506812
    4870, 0.0030206918716430666, 76.4734093432539
    5792, 0.004576373100280762, 84.91691950382248
    6888, 0.00769500732421875, 84.93767589888
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 2.5534629821777342e-05, 328.51888038125117
    0.00593164, 3.478527069091797e-05, 341.04319915777927
    0.008388608, 3.7288665771484376e-05, 449.9280318264962
    0.01186328, 4.84466552734375e-05, 489.7460901291339
    0.016777216, 5.822181701660156e-05, 576.3205911356594
    0.023726564, 7.894039154052735e-05, 601.1260784745152
    0.033554432, 0.00010094642639160156, 664.7968273751912
    0.047453132, 0.00013959407806396484, 679.8731387194808
    0.067108864, 0.0001867055892944336, 718.8736475818057
    0.094906264, 0.00026137828826904296, 726.1985272649019
    0.134217728, 0.00035915374755859377, 747.4109843618055
    0.189812528, 0.0005042552947998047, 752.8429744118316
    0.268435456, 0.0007027387619018555, 763.9694024377432
    0.37962506, 0.000991511344909668, 765.75031026919
    ```

    ## V100 bfloat16
    ```
  8. iakashpaul revised this gist Feb 19, 2024. 1 changed file with 45 additions and 0 deletions.
    45 changes: 45 additions & 0 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,45 @@
    # Runs for dtypes

    ## A100 float16
    ```
    benchmarking cuda using torch.float16
    size, elapsed_time, tops
    256, 0.01777644157409668, 0.0018875786731633935
    304, 0.008939647674560547, 0.006285362694985866
    362, 0.009391403198242188, 0.010102415368318777
    430, 0.009010767936706543, 0.01764710856132868
    512, 0.009000349044799804, 0.029825005081896894
    608, 0.008980417251586914, 0.050054625682405526
    724, 0.009735321998596192, 0.07796422636143384
    861, 0.009606742858886718, 0.1328811211824123
    1024, 0.009710216522216797, 0.22115713311712432
    1217, 0.00892808437347412, 0.4037787363110709
    1448, 0.007884597778320313, 0.7701159849518099
    1722, 0.007819414138793945, 1.3060362214777321
    2048, 0.008268022537231445, 2.0778691768966437
    2435, 0.009387540817260741, 3.075920127762037
    2896, 0.009256076812744141, 5.24805911345917
    3444, 0.009479331970214843, 8.618698556470994
    4096, 0.009059309959411621, 15.1710178907408
    4870, 0.010232973098754882, 22.57433922386718
    5792, 0.009942054748535156, 39.08764495923313
    6888, 0.019882059097290038, 32.87365936021618
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 0.008298230171203614, 1.0108912173959712
    0.00593164, 0.008292603492736816, 1.4305857033186993
    0.008388608, 0.00836634635925293, 2.0053217114834005
    0.01186328, 0.008056378364562989, 2.9450652546762592
    0.016777216, 0.007922005653381348, 4.235598088178334
    0.023726564, 0.00619211196899414, 7.663480285500778
    0.033554432, 0.00462348461151123, 14.5147804391772
    0.047453132, 0.0036125898361206053, 26.2709768629354
    0.067108864, 0.0042188167572021484, 31.814069139379033
    0.094906264, 0.006849765777587891, 27.710805619231184
    0.134217728, 0.007091808319091797, 37.85148214981321
    0.189812528, 0.0068720817565917965, 55.241638479614764
    0.268435456, 0.008846306800842285, 60.68870592967483
    0.37962506, 0.009415888786315918, 80.63499232312684
    ```
    ## A100 bfloat16
    ```
    benchmarking cuda using torch.bfloat16
    @@ -40,10 +80,15 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.268435456, 0.0083709716796875, 64.1348379307911
    0.37962506, 0.006459522247314453, 117.53967103614487
    ```

    ## A100 INT8

    Need to revise torch & cuda versions

    ## V100 float16

    Need to revise torch & cuda versions

    ## V100 bfloat16
    ```
    benchmarking cuda using torch.bfloat16
  9. iakashpaul revised this gist Feb 19, 2024. 1 changed file with 14 additions and 12 deletions.
    26 changes: 14 additions & 12 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,7 @@
    # Runs for dtypes

    ## A100 bfloat16
    ```
    benchmarking cuda using torch.bfloat16
    size, elapsed_time, tops
    256, 0.015097665786743163, 0.0022224913754193185
    @@ -38,13 +39,13 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.00883655548095703, 42.96075057957824
    0.268435456, 0.0083709716796875, 64.1348379307911
    0.37962506, 0.006459522247314453, 117.53967103614487

    ```
    ## A100 INT8

    Need to revise torch & cuda versions

    ## V100 bfloat16

    ```
    benchmarking cuda using torch.bfloat16
    size, elapsed_time, tops
    256, 0.02667853832244873, 0.0012577312742717067
    @@ -82,13 +83,13 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.0005018472671508789, 756.4553617183828
    0.268435456, 0.0007009267807006836, 765.9443565037069
    0.37962506, 0.0009888172149658202, 767.8366724493611

    ```
    ## V100 INT8

    Need to revise torch & cuda versions

    ## XEON 6330 bfloat16

    ```
    benchmarking cpu using torch.bfloat16
    size, elapsed_time, tops
    256, 0.0021901369094848634, 0.01532070066244957
    @@ -126,9 +127,9 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.00473179817199707, 80.22849711693813
    0.268435456, 0.00640721321105957, 83.79164143832462
    0.37962506, 0.009680533409118652, 78.43060789241413

    ```
    ## XEON 6330 int8

    ```
    benchmarking cpu using torch.int8
    size, elapsed_time, tops
    256, 0.0016846656799316406, 0.019917561329652986
    @@ -166,9 +167,9 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.004472994804382324, 84.8704442106819
    0.268435456, 0.006230497360229492, 86.16822718310648
    0.37962506, 0.008964014053344727, 84.6997913525919

    ```
    ## XEON 6230 bfloat16

    ```
    benchmarking cpu using torch.bfloat16
    size, elapsed_time, tops
    256, 0.001166057586669922, 0.028775964741009238
    @@ -247,10 +248,10 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.020318937301635743, 18.683312535712137
    0.268435456, 0.02902853488922119, 18.494592098733502
    0.37962506, 0.04167752265930176, 18.217256486346063

    ```

    ## M1 Pro CPU - INT8

    ```
    benchmarking cpu using torch.int8
    size, elapsed_time, tops
    256, 0.004259657859802246, 0.007877259889027275
    @@ -288,9 +289,9 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.004046845436096192, 93.8076489440148
    0.268435456, 0.005482769012451172, 97.91966628190707
    0.37962506, 0.008221673965454101, 92.34738852333764

    ```
    ## M1 Pro GPU - FP16

    ```
    benchmarking mps using torch.float16
    size, elapsed_time, tops
    256, 0.009627270698547363, 0.003485352500274346
    @@ -328,3 +329,4 @@ size (GB), elapsed_time, bandwidth (GB/s)
    0.189812528, 0.0022737979888916016, 166.95636897148202
    0.268435456, 0.0034802913665771484, 154.26033496960062
    0.37962506, 0.004375720024108886, 173.51432811440466
    ```
  10. iakashpaul created this gist Feb 19, 2024.
    330 changes: 330 additions & 0 deletions benchmark.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,330 @@
    # Runs for dtypes

    ## A100 bfloat16
    benchmarking cuda using torch.bfloat16
    size, elapsed_time, tops
    256, 0.015097665786743163, 0.0022224913754193185
    304, 0.005758213996887207, 0.009758047899986834
    362, 0.006917119026184082, 0.013716094177482947
    430, 0.00832064151763916, 0.019110786068946943
    512, 0.006926321983337402, 0.03875584424832877
    608, 0.006831693649291992, 0.06579794807493826
    724, 0.006945896148681641, 0.10927414285398761
    861, 0.008063292503356934, 0.15831681183195834
    1024, 0.008281826972961426, 0.25930071408290967
    1217, 0.009384751319885254, 0.3841306501496171
    1448, 0.008578181266784668, 0.7078487379966464
    1722, 0.008532881736755371, 1.1968334275640953
    2048, 0.0074500560760498045, 2.306005351990474
    2435, 0.008992719650268554, 3.2109669680559523
    2896, 0.008328795433044434, 5.832348586600332
    3444, 0.007643985748291016, 10.688076542563643
    4096, 0.008878946304321289, 15.47919637774022
    4870, 0.010046720504760742, 22.992836905389876
    5792, 0.00858457088470459, 45.26860007276564
    6888, 0.01943228244781494, 33.63454807222059
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 0.007399177551269532, 1.1337216794535097
    0.00593164, 0.008727836608886718, 1.3592463438099611
    0.008388608, 0.008515739440917968, 1.9701420077962688
    0.01186328, 0.008480381965637208, 2.797817373809437
    0.016777216, 0.008199071884155274, 4.09246710775204
    0.023726564, 0.006315088272094727, 7.514246191884141
    0.033554432, 0.00802006721496582, 8.367618649725495
    0.047453132, 0.008075571060180664, 11.752266594243402
    0.067108864, 0.0070595979690551754, 19.012092273288914
    0.094906264, 0.007315444946289063, 25.946819283533397
    0.134217728, 0.008006620407104491, 33.52668696043214
    0.189812528, 0.00883655548095703, 42.96075057957824
    0.268435456, 0.0083709716796875, 64.1348379307911
    0.37962506, 0.006459522247314453, 117.53967103614487

    ## A100 INT8

    Need to revise torch & cuda versions

    ## V100 bfloat16

    benchmarking cuda using torch.bfloat16
    size, elapsed_time, tops
    256, 0.02667853832244873, 0.0012577312742717067
    304, 6.992816925048828e-05, 0.8035235099424206
    362, 8.306503295898437e-05, 1.1421876645356601
    430, 8.804798126220703e-05, 1.8059925704197128
    512, 0.00011758804321289062, 2.2828465264448985
    608, 0.00013427734375, 3.347634168552727
    724, 0.00016186237335205078, 4.6892111630487445
    861, 0.00022852420806884766, 5.586081110564057
    1024, 0.000320124626159668, 6.708273817487892
    1217, 0.0006142377853393555, 5.86901475624512
    1448, 0.000863027572631836, 7.03575989522911
    1722, 0.0012133121490478516, 8.416991541718449
    2048, 0.002041149139404297, 8.416763308639903
    2435, 0.0033989667892456053, 8.495324473708326
    2896, 0.005703592300415039, 8.516814616722376
    3444, 0.009201717376708985, 8.878723549453332
    4096, 0.014371323585510253, 9.563416525571206
    4870, 0.024599909782409668, 9.390384275522017
    5792, 0.04042065143585205, 9.614182166082358
    6888, 0.06916227340698242, 9.45018152162151
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 2.446174621582031e-05, 342.9276032049903
    0.00593164, 3.101825714111328e-05, 382.46120489715605
    0.008388608, 3.483295440673828e-05, 481.6478040907872
    0.01186328, 4.6205520629882815e-05, 513.5005444491228
    0.016777216, 5.626678466796875e-05, 596.3452896412203
    0.023726564, 7.715225219726563e-05, 615.0582341869962
    0.033554432, 9.870529174804688e-05, 679.8912480933719
    0.047453132, 0.00013720989227294922, 691.6867466902797
    0.067108864, 0.00018470287322998048, 726.668327638198
    0.094906264, 0.00025992393493652345, 730.2618285090001
    0.134217728, 0.0003578662872314453, 750.0998713142066
    0.189812528, 0.0005018472671508789, 756.4553617183828
    0.268435456, 0.0007009267807006836, 765.9443565037069
    0.37962506, 0.0009888172149658202, 767.8366724493611

    ## V100 INT8

    Need to revise torch & cuda versions

    ## XEON 6330 bfloat16

    benchmarking cpu using torch.bfloat16
    size, elapsed_time, tops
    256, 0.0021901369094848634, 0.01532070066244957
    304, 0.0007275581359863281, 0.07722946830060035
    362, 0.0006016969680786132, 0.15768046214852163
    430, 0.0004467487335205078, 0.3559360957711602
    512, 0.000615072250366211, 0.4364291444463229
    608, 0.0010200977325439454, 0.4406552525893741
    724, 0.0014643907546997071, 0.5183089592474548
    861, 0.0031775951385498045, 0.4017361263280998
    1024, 0.003713393211364746, 0.5783076355683746
    1217, 0.005452871322631836, 0.6611141933677716
    1448, 0.008942294120788574, 0.6790265117632406
    1722, 0.015332889556884766, 0.666047848196651
    2048, 0.02421088218688965, 0.7095928620603098
    2435, 0.04082856178283691, 0.7072334779653765
    2896, 0.06611626148223877, 0.7347124169301285
    3444, 0.11195228099822999, 0.7297707919795908
    4096, 0.1859917163848877, 0.7389520143337268
    4870, 0.3130293369293213, 0.7379583276955218
    5792, 0.5199612140655517, 0.7473855658145446
    6888, 0.8682275295257569, 0.7527934970007311
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 1.6951560974121092e-05, 494.8575539920113
    0.00593164, 2.9134750366210938e-05, 407.18660194042553
    0.008388608, 3.204345703125e-05, 523.5769656076191
    0.01186328, 4.3773651123046874e-05, 542.028352474074
    0.016777216, 0.00019590854644775392, 171.2759989720433
    0.023726564, 0.00010991096496582031, 431.74152837941864
    0.033554432, 0.00019106864929199218, 351.22907001579233
    0.047453132, 0.000468754768371582, 202.46463695654137
    0.067108864, 0.0011372804641723634, 118.01638402157438
    0.094906264, 0.0020810604095458985, 91.20952334171712
    0.134217728, 0.0030328989028930663, 88.5078812696133
    0.189812528, 0.00473179817199707, 80.22849711693813
    0.268435456, 0.00640721321105957, 83.79164143832462
    0.37962506, 0.009680533409118652, 78.43060789241413

    ## XEON 6330 int8

    benchmarking cpu using torch.int8
    size, elapsed_time, tops
    256, 0.0016846656799316406, 0.019917561329652986
    304, 0.004039764404296875, 0.013908961606829084
    362, 0.005875968933105468, 0.016146418927687863
    430, 0.00986475944519043, 0.01611939965525742
    512, 0.011141633987426758, 0.024093006133833438
    608, 0.018270087242126466, 0.02460368240407379
    724, 0.0413280725479126, 0.01836540639827966
    861, 0.07472686767578125, 0.01708294220946889
    1024, 0.08651659488677979, 0.024821638563217972
    1217, 0.15329647064208984, 0.023516331530011113
    1448, 0.2395930290222168, 0.025343203050523455
    1722, 0.3982081890106201, 0.025645977098998424
    2048, 0.4549932241439819, 0.03775851655004747
    2435, 0.727786374092102, 0.0396755514776178
    2896, 1.2666751861572265, 0.03834956175258211
    3444, 2.090726399421692, 0.039077090522508635
    4096, 3.1241865873336794, 0.043991915857143654
    4870, 5.981458854675293, 0.038619776815724684
    5792, 11.528981041908263, 0.033707359285558985
    6888, 24.265810799598693, 0.026934852642748253
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 5.822181701660156e-05, 144.08014778391484
    0.00593164, 9.953975677490234e-05, 119.1813239691497
    0.008388608, 8.499622344970703e-05, 197.38778170452736
    0.01186328, 4.9614906311035155e-05, 478.2143465364728
    0.016777216, 6.074905395507813e-05, 552.3449307508947
    0.023726564, 0.00011255741119384766, 421.59043546475743
    0.033554432, 0.00021660327911376953, 309.8238598906505
    0.047453132, 0.00045404434204101565, 209.02421902975004
    0.067108864, 0.0012057304382324218, 111.31652958580084
    0.094906264, 0.001912236213684082, 99.26207162153382
    0.134217728, 0.003130984306335449, 85.73516496292531
    0.189812528, 0.004472994804382324, 84.8704442106819
    0.268435456, 0.006230497360229492, 86.16822718310648
    0.37962506, 0.008964014053344727, 84.6997913525919

    ## XEON 6230 bfloat16

    benchmarking cpu using torch.bfloat16
    size, elapsed_time, tops
    256, 0.001166057586669922, 0.028775964741009238
    304, 0.0003167867660522461, 0.17737144988794462
    362, 0.0004210948944091797, 0.22530754293071226
    430, 0.0005740880966186524, 0.2769853632858507
    512, 0.000850057601928711, 0.31578501902805406
    608, 0.0014193534851074218, 0.31670153257557215
    724, 0.002286386489868164, 0.33196786779638704
    861, 0.003532099723815918, 0.3614152662204194
    1024, 0.005914664268493653, 0.3630778604694872
    1217, 0.008795619010925293, 0.40985979741984746
    1448, 0.014617276191711426, 0.41540261703771425
    1722, 0.024225759506225585, 0.42155285547912696
    2048, 0.038063979148864745, 0.45134191348758107
    2435, 0.06337082386016846, 0.4556564676153674
    2896, 0.08836662769317627, 0.5497147457144741
    3444, 0.1559471845626831, 0.5238921433375465
    4096, 0.2487639904022217, 0.552487332470337
    4870, 0.41010868549346924, 0.5632716744880513
    5792, 0.636151385307312, 0.610878974960134
    6888, 1.0622331142425536, 0.6153037684294559
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 5.710124969482422e-05, 146.90760788656365
    0.00593164, 7.467269897460937e-05, 158.8703791734355
    0.008388608, 0.00011763572692871093, 142.6200733231942
    0.01186328, 0.00030057430267333985, 78.93742009537559
    0.016777216, 0.0006120920181274414, 54.81926084031005
    0.023726564, 0.0010343313217163086, 45.87807311225872
    0.033554432, 0.002415776252746582, 27.779420351409424
    0.047453132, 0.005098819732666016, 18.613378973171983
    0.067108864, 0.006124520301818847, 21.91481477498577
    0.094906264, 0.00918436050415039, 20.666929168799957
    0.134217728, 0.013114047050476075, 20.469307069495002
    0.189812528, 0.025612187385559083, 14.822047421613195
    0.268435456, 0.027147817611694335, 19.775840536394895
    0.37962506, 0.0396291971206665, 19.15885698335416


    ## XEON 6230 INT8

    benchmarking cpu using torch.int8
    size, elapsed_time, tops
    256, 0.0011035919189453125, 0.030404746015236777
    304, 0.0028478622436523436, 0.019730212767573505
    362, 0.003856062889099121, 0.024604333157586422
    430, 0.006611490249633789, 0.0240511585128342
    512, 0.007130289077758789, 0.03764720519359018
    608, 0.0132371187210083, 0.03395840389998101
    724, 0.029151320457458496, 0.02603679133875408
    861, 0.052414536476135254, 0.024354975696126307
    1024, 0.06348373889923095, 0.033827302632706384
    1217, 0.11844491958618164, 0.030435840039360992
    1448, 0.21307857036590577, 0.028496787704051427
    1722, 0.3561347484588623, 0.028675769888204704
    2048, 0.5653518438339233, 0.030387924566576144
    2435, 0.9055092811584473, 0.03188849231126473
    2896, 1.6462963581085206, 0.029506496830139946
    3444, 3.509342336654663, 0.02328057423029335
    4096, 8.10109314918518, 0.01696548242823548
    4870, 14.660134220123291, 0.01575719584360376
    5792, 25.486044383049013, 0.015248011826993006
    6888, 44.70684518814087, 0.014619596515778656
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 6.196498870849609e-05, 135.37657594779532
    0.00593164, 9.491443634033204e-05, 124.98920561949258
    0.008388608, 0.00016396045684814454, 102.32476977993892
    0.01186328, 0.000305485725402832, 77.66830993072661
    0.016777216, 0.0006456851959228515, 51.96716946877188
    0.023726564, 0.0012836456298828125, 36.967467418817236
    0.033554432, 0.0023816823959350586, 28.177083608854897
    0.047453132, 0.00407719612121582, 23.277335987384127
    0.067108864, 0.006319093704223633, 21.24002812464862
    0.094906264, 0.009533262252807618, 19.91055348803593
    0.134217728, 0.014014458656311036, 19.154179450172144
    0.189812528, 0.020318937301635743, 18.683312535712137
    0.268435456, 0.02902853488922119, 18.494592098733502
    0.37962506, 0.04167752265930176, 18.217256486346063


    ## M1 Pro CPU - INT8

    benchmarking cpu using torch.int8
    size, elapsed_time, tops
    256, 0.004259657859802246, 0.007877259889027275
    304, 0.007328653335571289, 0.007667019495556467
    362, 0.01215658187866211, 0.007804484595010316
    430, 0.020352959632873535, 0.007812819504794035
    512, 0.03251914978027344, 0.008254688631583984
    608, 0.05446903705596924, 0.008252604567584112
    724, 0.09682230949401856, 0.007839173140637484
    861, 0.1643320083618164, 0.007768144348296151
    1024, 0.2544929265975952, 0.008438284225461425
    1217, 0.4326848745346069, 0.008331630796841428
    1448, 0.7112574338912964, 0.008537070397675465
    1722, 1.2066871166229247, 0.008463203058453855
    2048, 1.9730838775634765, 0.008707115485234763
    2435, 3.3893067121505736, 0.008519537534470614
    2896, 5.669154453277588, 0.008568550861039925
    3444, 9.958435702323914, 0.008204050034578674
    4096, 16.59977195262909, 0.008279568771439191
    4870, 28.29214940071106, 0.008164901249750726
    5792, 47.3701210975647, 0.00820372625553576
    6888, 80.74701671600342, 0.008094367627757355
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 3.3855438232421875e-05, 247.77726823121128
    0.00593164, 7.402896881103516e-05, 160.2518607314654
    0.008388608, 9.772777557373046e-05, 171.67295481254942
    0.01186328, 0.0001516103744506836, 156.49694215165906
    0.016777216, 0.0002319812774658203, 144.64284517505448
    0.023726564, 0.0004257917404174805, 111.44680249897083
    0.033554432, 0.0006222724914550781, 107.8448186630866
    0.047453132, 0.0009853601455688476, 96.3163209175775
    0.067108864, 0.0013759851455688477, 97.54300650136226
    0.094906264, 0.0025406122207641602, 74.71133392521767
    0.134217728, 0.003232312202453613, 83.04750258846703
    0.189812528, 0.004046845436096192, 93.8076489440148
    0.268435456, 0.005482769012451172, 97.91966628190707
    0.37962506, 0.008221673965454101, 92.34738852333764

    ## M1 Pro GPU - FP16

    benchmarking mps using torch.float16
    size, elapsed_time, tops
    256, 0.009627270698547363, 0.003485352500274346
    304, 0.006162810325622559, 0.009117419656157253
    362, 0.011301898956298828, 0.008394682731358462
    430, 0.007365679740905762, 0.021588503110840648
    512, 0.0011082172393798828, 0.24222277587939936
    608, 0.0034063577651977537, 0.1319624816255623
    724, 0.008305120468139648, 0.09139022737981042
    861, 0.0017158985137939453, 0.7439570299396482
    1024, 0.0019629955291748046, 1.0939829541551476
    1217, 0.0027472972869873047, 1.3121880340635514
    1448, 0.014558553695678711, 0.4170781597489533
    1722, 0.0037976980209350588, 2.6891127308446503
    2048, 0.004535555839538574, 3.787820014084051
    2435, 0.0075850248336791996, 3.8068861187885794
    2896, 0.022348809242248534, 2.1735582305732133
    3444, 0.01922299861907959, 4.250091590128399
    4096, 0.02993953227996826, 4.590551121066
    4870, 0.052914762496948244, 4.365560669639642
    5792, 0.0893251657485962, 4.35052656123515
    6888, 0.15182690620422362, 4.304876220456225
    size (GB), elapsed_time, bandwidth (GB/s)
    0.004194304, 0.00025799274444580076, 32.51489888996581
    0.00593164, 0.000435328483581543, 27.251329622169887
    0.008388608, 0.0002537250518798828, 66.12360851124225
    0.01186328, 0.0002813577651977539, 84.32879036881621
    0.016777216, 0.00034019947052001955, 98.63164086854579
    0.023726564, 0.0004561424255371094, 104.03138437325528
    0.033554432, 0.0005932807922363281, 113.11484355837325
    0.047453132, 0.000733184814453125, 129.44384843920918
    0.067108864, 0.0009582281112670898, 140.06866050143364
    0.094906264, 0.0012568950653076172, 151.01700471196023
    0.134217728, 0.0016906261444091797, 158.7787204685692
    0.189812528, 0.0022737979888916016, 166.95636897148202
    0.268435456, 0.0034802913665771484, 154.26033496960062
    0.37962506, 0.004375720024108886, 173.51432811440466