Last active
February 22, 2024 08:37
-
-
Save iakashpaul/3dfe37f10a1d3eb38cb610b85a9e5ccf to your computer and use it in GitHub Desktop.
Revisions
-
iakashpaul revised this gist
Feb 22, 2024 . 1 changed file with 9 additions and 9 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -7,18 +7,18 @@ DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${D ``` | Device | Type | FP32 (TFLOPS) | BW | F16 | BF16 | INT8 | |-|-|-|-|-|-|-| | Apple M1 Pro CPU 10-core | CPU | 0.33 | 96 | | | 0.008 | | Apple M1 Pro GPU 16-core | GPU | 3.74 | 176 | 4.3 | | | | Intel Xeon 8358 60-core | CPU | 3.5 | 96 | | | | | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | NA | 0.75 | 0.02 | | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | NA | 0.61 | 0.014 | | AMD Ryzen 5 3600 6-core | CPU | 0.36 | 14 | | | | | Nvidia A100 80GB | GPU | 19 | 1490 | 32| 33 | NA | * revise these with idle card | Nvidia A10 24GB | GPU | 14.48 | 469 | | | | | Nvidia V100 32GB | GPU | 13 | 766 | 84 | 9.4 | NA | | Nvidia RTX 2070S 8GB | GPU | 8 | 376 | 37 | 5 | NA | ## Ryzen 5 3600 ``` -
iakashpaul revised this gist
Feb 20, 2024 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -472,10 +472,10 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.025612187385559083, 14.822047421613195 0.268435456, 0.027147817611694335, 19.775840536394895 0.37962506, 0.0396291971206665, 19.15885698335416 ``` ## XEON 6230 INT8 ``` benchmarking cpu using torch.int8 size, elapsed_time, tops 256, 0.0011035919189453125, 0.030404746015236777 -
iakashpaul revised this gist
Feb 20, 2024 . 1 changed file with 9 additions and 9 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -21,7 +21,7 @@ DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${D | Nvidia RTX 2070S 8GB | GPU | | | | | | ## Ryzen 5 3600 ``` benchmarking cpu using torch.float32 size, elapsed_time, tops 256, 0.0007100820541381836, 0.04725430223796394 @@ -59,7 +59,7 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.02665371894836426, 14.242855067821507 0.268435456, 0.03752543926239014, 14.306852166233767 0.37962506, 0.053798246383666995, 14.1129157739704 ``` ## A100 float16 ``` @@ -232,7 +232,7 @@ size (GB), elapsed_time, bandwidth (GB/s) Need to revise torch & cuda versions ## RTX 2070S F32 ``` benchmarking cuda using torch.float32 size, elapsed_time, tops 256, 0.014125776290893555, 0.00237540445983358 @@ -270,9 +270,9 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.0009937524795532227, 382.0116817929089 0.268435456, 0.0014146089553833008, 379.51895466018027 0.37962506, 0.002018284797668457, 376.1858192050465 ``` ## RTX 2070S float16 ``` benchmarking cuda using torch.float16 size, elapsed_time, tops 256, 0.005084848403930664, 0.006598905087133359 @@ -310,9 +310,9 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.001004338264465332, 377.9852559954953 0.268435456, 0.0014100074768066406, 380.75749301407643 0.37962506, 0.0019683837890625, 385.7226035993798 ``` ## RTX 2070S bfloat16 ``` benchmarking cuda using torch.bfloat16 size, elapsed_time, tops 256, 0.0027062654495239257, 0.012398795545316055 @@ -350,7 +350,7 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.001009511947631836, 376.0481061076529 0.268435456, 0.001414942741394043, 379.4294258657132 0.37962506, 0.0020305871963500976, 373.9066814588031 ``` ## XEON 6330 bfloat16 -
iakashpaul revised this gist
Feb 20, 2024 . 1 changed file with 165 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -14,9 +14,52 @@ DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${D | Intel Xeon 8358 60-core | CPU | 3.5 | 96 | | | | | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | | | | | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | | | | | AMD Ryzen 5 3600 6-core | CPU | | | | | | | Nvidia A100 80GB | GPU | 18.9 | 1490 | 33| | | | Nvidia A10 24GB | GPU | 14.48 | 469 | | | | | Nvidia V100 32GB | GPU | 13 | 766 | | | | | Nvidia RTX 2070S 8GB | GPU | | | | | | ## Ryzen 5 3600 benchmarking cpu using torch.float32 size, elapsed_time, tops 256, 0.0007100820541381836, 0.04725430223796394 304, 0.00020635128021240234, 0.27229745287823454 362, 0.0003533363342285156, 0.268514293066278 430, 0.0005476951599121093, 0.2903330385930698 512, 0.0006838560104370118, 0.3925321294294962 608, 0.001193690299987793, 0.37657290505300817 724, 0.0021503925323486327, 0.3529619995336488 861, 0.003527235984802246, 0.36191362514452513 1024, 0.00509192943572998, 0.421742617431252 1217, 0.008618521690368652, 0.418281783757488 1448, 0.013982748985290528, 0.43425329242394584 1722, 0.02307753562927246, 0.4425272377457036 2048, 0.0389744758605957, 0.44079795313859077 2435, 0.0737607479095459, 0.3914727896388784 2896, 0.14375429153442382, 0.3379129607436293 3444, 0.24540278911590577, 0.3329200334777477 4096, 0.37633934020996096, 0.3651995387867831 4870, 0.6344788789749145, 0.36408242047901673 5792, 0.9949984312057495, 0.3905649436100884 6888, 1.7714987516403198, 0.3689508883586864 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 5.555152893066406e-05, 151.00588879327037 0.00593164, 8.046627044677734e-05, 147.43171187294814 0.008388608, 0.00013594627380371095, 123.41063517654155 0.01186328, 0.0003290891647338867, 72.09766392395856 0.016777216, 0.0010283470153808593, 32.629483528546785 0.023726564, 0.0030118942260742186, 15.755243855907796 0.033554432, 0.004802894592285156, 13.97258730345578 0.047453132, 0.006703615188598633, 14.157474934034783 0.067108864, 0.009397172927856445, 14.282777281040833 0.094906264, 0.013365435600280761, 14.201746480751643 0.134217728, 0.018894267082214356, 14.207243648666582 0.189812528, 0.02665371894836426, 14.242855067821507 0.268435456, 0.03752543926239014, 14.306852166233767 0.37962506, 0.053798246383666995, 14.1129157739704 ## A100 float16 ``` @@ -188,6 +231,128 @@ size (GB), elapsed_time, bandwidth (GB/s) Need to revise torch & cuda versions ## RTX 2070S F32 benchmarking cuda using torch.float32 size, elapsed_time, tops 256, 0.014125776290893555, 0.00237540445983358 304, 5.047321319580078e-05, 1.1132425388101652 362, 5.1856040954589844e-05, 1.8296008382722941 430, 6.949901580810547e-05, 2.2880036235197254 512, 7.755756378173828e-05, 3.4611125325626313 608, 0.00010981559753417969, 4.093329491378411 724, 0.00015544891357421875, 4.882677083732809 861, 0.00023860931396484374, 5.349978761466475 1024, 0.00034000873565673826, 6.315966099671126 1217, 0.0005458593368530273, 6.604211712825641 1448, 0.0008722305297851563, 6.961525166397971 1722, 0.0014461994171142579, 7.061569777408616 2048, 0.0022562503814697265, 7.614345165366353 2435, 0.004026150703430176, 7.171943595007254 2896, 0.006442856788635254, 7.539580634119545 3444, 0.009186863899230957, 8.893078820383868 4096, 0.015436434745788574, 8.903542543040686 4870, 0.02744767665863037, 8.416107813896367 5792, 0.04533388614654541, 8.572208103222879 6888, 0.08088059425354004, 8.080999455755025 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 3.304481506347656e-05, 253.85549847642136 0.00593164, 4.2486190795898435e-05, 279.2267270320988 0.008388608, 5.2881240844726565e-05, 317.26214687855725 0.01186328, 7.462501525878906e-05, 317.9437875854313 0.016777216, 9.911060333251953e-05, 338.55542062864566 0.023726564, 0.00013742446899414062, 345.30333897104794 0.033554432, 0.000186920166015625, 359.0242049880816 0.047453132, 0.0002597332000732422, 365.39904784308425 0.067108864, 0.00036420822143554685, 368.51921538446715 0.094906264, 0.0005475997924804688, 346.6263694882062 0.134217728, 0.0007352352142333985, 365.10146794300016 0.189812528, 0.0009937524795532227, 382.0116817929089 0.268435456, 0.0014146089553833008, 379.51895466018027 0.37962506, 0.002018284797668457, 376.1858192050465 ## RTYX 2070S float16 benchmarking cuda using torch.float16 size, elapsed_time, tops 256, 0.005084848403930664, 0.006598905087133359 304, 2.4533271789550783e-05, 2.290315310652206 362, 0.0006063461303710937, 0.15647144633698648 430, 0.00015423297882080078, 1.0309986957118566 512, 3.454685211181641e-05, 7.77018569249568 608, 5.1641464233398436e-05, 8.704467053226667 724, 6.458759307861328e-05, 11.751588994439986 861, 0.000567626953125, 2.2489326043664515 1024, 9.047985076904297e-05, 23.734385387986805 1217, 0.0002730607986450195, 13.202080430030826 1448, 0.00027284622192382815, 22.254494642389318 1722, 0.0004123687744140625, 24.765304091006698 2048, 0.00050201416015625, 34.22188166694906 2435, 0.0011893272399902343, 24.27870545556251 2896, 0.0013867616653442383, 35.02868552394098 3444, 0.002403569221496582, 33.990909867422005 4096, 0.0035764694213867186, 38.428667291306034 4870, 0.006640505790710449, 34.78689926348 5792, 0.010618138313293456, 36.59883632232168 6888, 0.017475819587707518, 37.40002206269852 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 3.1566619873046876e-05, 265.74299160749246 0.00593164, 4.220008850097656e-05, 281.1197895882486 0.008388608, 5.3262710571289064e-05, 314.98990231720677 0.01186328, 7.414817810058594e-05, 319.9884421679743 0.016777216, 9.582042694091796e-05, 350.1803641585668 0.023726564, 0.00013625621795654297, 348.26394502696763 0.033554432, 0.0001840829849243164, 364.55766961618446 0.047453132, 0.00025680065155029295, 369.5717414541417 0.067108864, 0.00035834312438964844, 374.5508672131151 0.094906264, 0.0005070447921752929, 374.3506114828194 0.134217728, 0.000709366798400879, 378.4155906438423 0.189812528, 0.001004338264465332, 377.9852559954953 0.268435456, 0.0014100074768066406, 380.75749301407643 0.37962506, 0.0019683837890625, 385.7226035993798 ## RTX 2070S bfloat16 benchmarking cuda using torch.bfloat16 size, elapsed_time, tops 256, 0.0027062654495239257, 0.012398795545316055 304, 4.4178962707519534e-05, 1.2718480597199782 362, 5.137920379638672e-05, 1.8465808924557958 430, 6.778240203857422e-05, 2.34594814018994 512, 8.475780487060547e-05, 3.167088345548872 608, 0.00012900829315185547, 3.484360679595077 724, 0.00019524097442626953, 3.8875387209595704 861, 0.0002892017364501953, 4.4140632683228755 1024, 0.00046432018280029297, 4.6250060358105225 1217, 0.0008016824722290039, 4.496756198219868 1448, 0.0012295007705688476, 4.93863438669556 1722, 0.002044367790222168, 4.995401583239668 2048, 0.0033872127532958984, 5.071978182440201 2435, 0.005963873863220215, 4.841706315768501 2896, 0.009510970115661621, 5.107411513364935 3444, 0.01592259407043457, 5.1310423670035945 4096, 0.027903199195861816, 4.9255625674773125 4870, 0.04716935157775879, 4.8973029790157625 5792, 0.08068933486938476, 4.816144621901542 6888, 0.13136739730834962, 4.975329126829385 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 3.173351287841797e-05, 264.3453951076784 0.00593164, 4.298686981201172e-05, 275.9745022580144 0.008388608, 5.3429603576660155e-05, 314.0059981154128 0.01186328, 7.274150848388672e-05, 326.1763537012127 0.016777216, 9.78231430053711e-05, 343.0111829279259 0.023726564, 0.00013377666473388672, 354.7190243858706 0.033554432, 0.0001855134963989258, 361.7465322075003 0.047453132, 0.0002609729766845703, 363.66318538302215 0.067108864, 0.00036041736602783204, 372.3952857189337 0.094906264, 0.000502777099609375, 377.5281892263429 0.134217728, 0.0007108211517333985, 377.641345288329 0.189812528, 0.001009511947631836, 376.0481061076529 0.268435456, 0.001414942741394043, 379.4294258657132 0.37962506, 0.0020305871963500976, 373.9066814588031 ## XEON 6330 bfloat16 ``` benchmarking cpu using torch.bfloat16 -
iakashpaul revised this gist
Feb 19, 2024 . 1 changed file with 7 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,11 @@ # Runs for dtypes ``` DEVICE=cuda && DTYPE=float32 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log DEVICE=cuda && DTYPE=float16 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log DEVICE=cuda && DTYPE=bfloat16 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE} --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log ``` | Device | Type | FP32-TFLOPS | BW | BF16 | F16 | INT8 | |-|-|-|-|-|-|-| -
iakashpaul revised this gist
Feb 19, 2024 . 1 changed file with 11 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,16 @@ # Runs for dtypes | Device | Type | FP32-TFLOPS | BW | BF16 | F16 | INT8 | |-|-|-|-|-|-|-| | Apple M1 Pro CPU 10-core | CPU | 0.33 | 96 | | | | | Apple M1 Pro GPU 16-core | GPU | 3.74 | 176 | | | | | Intel Xeon 8358 60-core | CPU | 3.5 | 96 | | | | | Intel Xeon 6330 56-core | CPU | 5.7 | 81 | | | | | Intel Xeon 6230 40-core | CPU | 1.9 | 17.5 | | | | | Nvidia A100 80GB | GPU | 18.9 | 1490 | 33| | | | Nvidia A10 24GB | GPU | 14.48 | 469 | | | | | Nvidia V100 32GB | GPU | 13 | 766 | | | | ## A100 float16 ``` benchmarking cuda using torch.float16 -
iakashpaul revised this gist
Feb 19, 2024 . 1 changed file with 39 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -86,8 +86,45 @@ size (GB), elapsed_time, bandwidth (GB/s) Need to revise torch & cuda versions ## V100 float16 ``` benchmarking cuda using torch.float16 size, elapsed_time, tops 256, 0.005288243293762207, 0.006345099901053988 304, 4.7779083251953126e-05, 1.1760151969366865 362, 5.8317184448242186e-05, 1.6268936317425347 430, 5.1641464233398436e-05, 3.079192318818098 512, 6.201267242431641e-05, 4.328719365023544 608, 4.842281341552734e-05, 9.283050535346607 724, 0.0006131649017333985, 1.2378510998498296 861, 0.00011780261993408204, 10.836386853826447 1024, 7.870197296142579e-05, 27.286274628115695 1217, 0.00018236637115478515, 19.767737895822073 1448, 0.0001292705535888672, 46.97167773653695 1722, 0.0002488374710083008, 41.04059591434817 2048, 0.0002832174301147461, 60.65964646681365 2435, 0.0006921768188476562, 41.71668996091485 2896, 0.0007654905319213867, 63.457921746037535 3444, 0.0011915206909179688, 68.5674242929489 4096, 0.0020316600799560546, 67.64859674506812 4870, 0.0030206918716430666, 76.4734093432539 5792, 0.004576373100280762, 84.91691950382248 6888, 0.00769500732421875, 84.93767589888 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 2.5534629821777342e-05, 328.51888038125117 0.00593164, 3.478527069091797e-05, 341.04319915777927 0.008388608, 3.7288665771484376e-05, 449.9280318264962 0.01186328, 4.84466552734375e-05, 489.7460901291339 0.016777216, 5.822181701660156e-05, 576.3205911356594 0.023726564, 7.894039154052735e-05, 601.1260784745152 0.033554432, 0.00010094642639160156, 664.7968273751912 0.047453132, 0.00013959407806396484, 679.8731387194808 0.067108864, 0.0001867055892944336, 718.8736475818057 0.094906264, 0.00026137828826904296, 726.1985272649019 0.134217728, 0.00035915374755859377, 747.4109843618055 0.189812528, 0.0005042552947998047, 752.8429744118316 0.268435456, 0.0007027387619018555, 763.9694024377432 0.37962506, 0.000991511344909668, 765.75031026919 ``` ## V100 bfloat16 ``` -
iakashpaul revised this gist
Feb 19, 2024 . 1 changed file with 45 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,45 @@ # Runs for dtypes ## A100 float16 ``` benchmarking cuda using torch.float16 size, elapsed_time, tops 256, 0.01777644157409668, 0.0018875786731633935 304, 0.008939647674560547, 0.006285362694985866 362, 0.009391403198242188, 0.010102415368318777 430, 0.009010767936706543, 0.01764710856132868 512, 0.009000349044799804, 0.029825005081896894 608, 0.008980417251586914, 0.050054625682405526 724, 0.009735321998596192, 0.07796422636143384 861, 0.009606742858886718, 0.1328811211824123 1024, 0.009710216522216797, 0.22115713311712432 1217, 0.00892808437347412, 0.4037787363110709 1448, 0.007884597778320313, 0.7701159849518099 1722, 0.007819414138793945, 1.3060362214777321 2048, 0.008268022537231445, 2.0778691768966437 2435, 0.009387540817260741, 3.075920127762037 2896, 0.009256076812744141, 5.24805911345917 3444, 0.009479331970214843, 8.618698556470994 4096, 0.009059309959411621, 15.1710178907408 4870, 0.010232973098754882, 22.57433922386718 5792, 0.009942054748535156, 39.08764495923313 6888, 0.019882059097290038, 32.87365936021618 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 0.008298230171203614, 1.0108912173959712 0.00593164, 0.008292603492736816, 1.4305857033186993 0.008388608, 0.00836634635925293, 2.0053217114834005 0.01186328, 0.008056378364562989, 2.9450652546762592 0.016777216, 0.007922005653381348, 4.235598088178334 0.023726564, 0.00619211196899414, 7.663480285500778 0.033554432, 0.00462348461151123, 14.5147804391772 0.047453132, 0.0036125898361206053, 26.2709768629354 0.067108864, 0.0042188167572021484, 31.814069139379033 0.094906264, 0.006849765777587891, 27.710805619231184 0.134217728, 0.007091808319091797, 37.85148214981321 0.189812528, 0.0068720817565917965, 55.241638479614764 0.268435456, 0.008846306800842285, 60.68870592967483 0.37962506, 0.009415888786315918, 80.63499232312684 ``` ## A100 bfloat16 ``` benchmarking cuda using torch.bfloat16 @@ -40,10 +80,15 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.268435456, 0.0083709716796875, 64.1348379307911 0.37962506, 0.006459522247314453, 117.53967103614487 ``` ## A100 INT8 Need to revise torch & cuda versions ## V100 float16 Need to revise torch & cuda versions ## V100 bfloat16 ``` benchmarking cuda using torch.bfloat16 -
iakashpaul revised this gist
Feb 19, 2024 . 1 changed file with 14 additions and 12 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,6 +1,7 @@ # Runs for dtypes ## A100 bfloat16 ``` benchmarking cuda using torch.bfloat16 size, elapsed_time, tops 256, 0.015097665786743163, 0.0022224913754193185 @@ -38,13 +39,13 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.00883655548095703, 42.96075057957824 0.268435456, 0.0083709716796875, 64.1348379307911 0.37962506, 0.006459522247314453, 117.53967103614487 ``` ## A100 INT8 Need to revise torch & cuda versions ## V100 bfloat16 ``` benchmarking cuda using torch.bfloat16 size, elapsed_time, tops 256, 0.02667853832244873, 0.0012577312742717067 @@ -82,13 +83,13 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.0005018472671508789, 756.4553617183828 0.268435456, 0.0007009267807006836, 765.9443565037069 0.37962506, 0.0009888172149658202, 767.8366724493611 ``` ## V100 INT8 Need to revise torch & cuda versions ## XEON 6330 bfloat16 ``` benchmarking cpu using torch.bfloat16 size, elapsed_time, tops 256, 0.0021901369094848634, 0.01532070066244957 @@ -126,9 +127,9 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.00473179817199707, 80.22849711693813 0.268435456, 0.00640721321105957, 83.79164143832462 0.37962506, 0.009680533409118652, 78.43060789241413 ``` ## XEON 6330 int8 ``` benchmarking cpu using torch.int8 size, elapsed_time, tops 256, 0.0016846656799316406, 0.019917561329652986 @@ -166,9 +167,9 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.004472994804382324, 84.8704442106819 0.268435456, 0.006230497360229492, 86.16822718310648 0.37962506, 0.008964014053344727, 84.6997913525919 ``` ## XEON 6230 bfloat16 ``` benchmarking cpu using torch.bfloat16 size, elapsed_time, tops 256, 0.001166057586669922, 0.028775964741009238 @@ -247,10 +248,10 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.020318937301635743, 18.683312535712137 0.268435456, 0.02902853488922119, 18.494592098733502 0.37962506, 0.04167752265930176, 18.217256486346063 ``` ## M1 Pro CPU - INT8 ``` benchmarking cpu using torch.int8 size, elapsed_time, tops 256, 0.004259657859802246, 0.007877259889027275 @@ -288,9 +289,9 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.004046845436096192, 93.8076489440148 0.268435456, 0.005482769012451172, 97.91966628190707 0.37962506, 0.008221673965454101, 92.34738852333764 ``` ## M1 Pro GPU - FP16 ``` benchmarking mps using torch.float16 size, elapsed_time, tops 256, 0.009627270698547363, 0.003485352500274346 @@ -328,3 +329,4 @@ size (GB), elapsed_time, bandwidth (GB/s) 0.189812528, 0.0022737979888916016, 166.95636897148202 0.268435456, 0.0034802913665771484, 154.26033496960062 0.37962506, 0.004375720024108886, 173.51432811440466 ``` -
iakashpaul created this gist
Feb 19, 2024 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,330 @@ # Runs for dtypes ## A100 bfloat16 benchmarking cuda using torch.bfloat16 size, elapsed_time, tops 256, 0.015097665786743163, 0.0022224913754193185 304, 0.005758213996887207, 0.009758047899986834 362, 0.006917119026184082, 0.013716094177482947 430, 0.00832064151763916, 0.019110786068946943 512, 0.006926321983337402, 0.03875584424832877 608, 0.006831693649291992, 0.06579794807493826 724, 0.006945896148681641, 0.10927414285398761 861, 0.008063292503356934, 0.15831681183195834 1024, 0.008281826972961426, 0.25930071408290967 1217, 0.009384751319885254, 0.3841306501496171 1448, 0.008578181266784668, 0.7078487379966464 1722, 0.008532881736755371, 1.1968334275640953 2048, 0.0074500560760498045, 2.306005351990474 2435, 0.008992719650268554, 3.2109669680559523 2896, 0.008328795433044434, 5.832348586600332 3444, 0.007643985748291016, 10.688076542563643 4096, 0.008878946304321289, 15.47919637774022 4870, 0.010046720504760742, 22.992836905389876 5792, 0.00858457088470459, 45.26860007276564 6888, 0.01943228244781494, 33.63454807222059 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 0.007399177551269532, 1.1337216794535097 0.00593164, 0.008727836608886718, 1.3592463438099611 0.008388608, 0.008515739440917968, 1.9701420077962688 0.01186328, 0.008480381965637208, 2.797817373809437 0.016777216, 0.008199071884155274, 4.09246710775204 0.023726564, 0.006315088272094727, 7.514246191884141 0.033554432, 0.00802006721496582, 8.367618649725495 0.047453132, 0.008075571060180664, 11.752266594243402 0.067108864, 0.0070595979690551754, 19.012092273288914 0.094906264, 0.007315444946289063, 25.946819283533397 0.134217728, 0.008006620407104491, 33.52668696043214 0.189812528, 0.00883655548095703, 42.96075057957824 0.268435456, 0.0083709716796875, 64.1348379307911 0.37962506, 0.006459522247314453, 117.53967103614487 ## A100 INT8 Need to revise torch & cuda versions ## V100 bfloat16 benchmarking cuda using torch.bfloat16 size, elapsed_time, tops 256, 0.02667853832244873, 0.0012577312742717067 304, 6.992816925048828e-05, 0.8035235099424206 362, 8.306503295898437e-05, 1.1421876645356601 430, 8.804798126220703e-05, 1.8059925704197128 512, 0.00011758804321289062, 2.2828465264448985 608, 0.00013427734375, 3.347634168552727 724, 0.00016186237335205078, 4.6892111630487445 861, 0.00022852420806884766, 5.586081110564057 1024, 0.000320124626159668, 6.708273817487892 1217, 0.0006142377853393555, 5.86901475624512 1448, 0.000863027572631836, 7.03575989522911 1722, 0.0012133121490478516, 8.416991541718449 2048, 0.002041149139404297, 8.416763308639903 2435, 0.0033989667892456053, 8.495324473708326 2896, 0.005703592300415039, 8.516814616722376 3444, 0.009201717376708985, 8.878723549453332 4096, 0.014371323585510253, 9.563416525571206 4870, 0.024599909782409668, 9.390384275522017 5792, 0.04042065143585205, 9.614182166082358 6888, 0.06916227340698242, 9.45018152162151 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 2.446174621582031e-05, 342.9276032049903 0.00593164, 3.101825714111328e-05, 382.46120489715605 0.008388608, 3.483295440673828e-05, 481.6478040907872 0.01186328, 4.6205520629882815e-05, 513.5005444491228 0.016777216, 5.626678466796875e-05, 596.3452896412203 0.023726564, 7.715225219726563e-05, 615.0582341869962 0.033554432, 9.870529174804688e-05, 679.8912480933719 0.047453132, 0.00013720989227294922, 691.6867466902797 0.067108864, 0.00018470287322998048, 726.668327638198 0.094906264, 0.00025992393493652345, 730.2618285090001 0.134217728, 0.0003578662872314453, 750.0998713142066 0.189812528, 0.0005018472671508789, 756.4553617183828 0.268435456, 0.0007009267807006836, 765.9443565037069 0.37962506, 0.0009888172149658202, 767.8366724493611 ## V100 INT8 Need to revise torch & cuda versions ## XEON 6330 bfloat16 benchmarking cpu using torch.bfloat16 size, elapsed_time, tops 256, 0.0021901369094848634, 0.01532070066244957 304, 0.0007275581359863281, 0.07722946830060035 362, 0.0006016969680786132, 0.15768046214852163 430, 0.0004467487335205078, 0.3559360957711602 512, 0.000615072250366211, 0.4364291444463229 608, 0.0010200977325439454, 0.4406552525893741 724, 0.0014643907546997071, 0.5183089592474548 861, 0.0031775951385498045, 0.4017361263280998 1024, 0.003713393211364746, 0.5783076355683746 1217, 0.005452871322631836, 0.6611141933677716 1448, 0.008942294120788574, 0.6790265117632406 1722, 0.015332889556884766, 0.666047848196651 2048, 0.02421088218688965, 0.7095928620603098 2435, 0.04082856178283691, 0.7072334779653765 2896, 0.06611626148223877, 0.7347124169301285 3444, 0.11195228099822999, 0.7297707919795908 4096, 0.1859917163848877, 0.7389520143337268 4870, 0.3130293369293213, 0.7379583276955218 5792, 0.5199612140655517, 0.7473855658145446 6888, 0.8682275295257569, 0.7527934970007311 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 1.6951560974121092e-05, 494.8575539920113 0.00593164, 2.9134750366210938e-05, 407.18660194042553 0.008388608, 3.204345703125e-05, 523.5769656076191 0.01186328, 4.3773651123046874e-05, 542.028352474074 0.016777216, 0.00019590854644775392, 171.2759989720433 0.023726564, 0.00010991096496582031, 431.74152837941864 0.033554432, 0.00019106864929199218, 351.22907001579233 0.047453132, 0.000468754768371582, 202.46463695654137 0.067108864, 0.0011372804641723634, 118.01638402157438 0.094906264, 0.0020810604095458985, 91.20952334171712 0.134217728, 0.0030328989028930663, 88.5078812696133 0.189812528, 0.00473179817199707, 80.22849711693813 0.268435456, 0.00640721321105957, 83.79164143832462 0.37962506, 0.009680533409118652, 78.43060789241413 ## XEON 6330 int8 benchmarking cpu using torch.int8 size, elapsed_time, tops 256, 0.0016846656799316406, 0.019917561329652986 304, 0.004039764404296875, 0.013908961606829084 362, 0.005875968933105468, 0.016146418927687863 430, 0.00986475944519043, 0.01611939965525742 512, 0.011141633987426758, 0.024093006133833438 608, 0.018270087242126466, 0.02460368240407379 724, 0.0413280725479126, 0.01836540639827966 861, 0.07472686767578125, 0.01708294220946889 1024, 0.08651659488677979, 0.024821638563217972 1217, 0.15329647064208984, 0.023516331530011113 1448, 0.2395930290222168, 0.025343203050523455 1722, 0.3982081890106201, 0.025645977098998424 2048, 0.4549932241439819, 0.03775851655004747 2435, 0.727786374092102, 0.0396755514776178 2896, 1.2666751861572265, 0.03834956175258211 3444, 2.090726399421692, 0.039077090522508635 4096, 3.1241865873336794, 0.043991915857143654 4870, 5.981458854675293, 0.038619776815724684 5792, 11.528981041908263, 0.033707359285558985 6888, 24.265810799598693, 0.026934852642748253 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 5.822181701660156e-05, 144.08014778391484 0.00593164, 9.953975677490234e-05, 119.1813239691497 0.008388608, 8.499622344970703e-05, 197.38778170452736 0.01186328, 4.9614906311035155e-05, 478.2143465364728 0.016777216, 6.074905395507813e-05, 552.3449307508947 0.023726564, 0.00011255741119384766, 421.59043546475743 0.033554432, 0.00021660327911376953, 309.8238598906505 0.047453132, 0.00045404434204101565, 209.02421902975004 0.067108864, 0.0012057304382324218, 111.31652958580084 0.094906264, 0.001912236213684082, 99.26207162153382 0.134217728, 0.003130984306335449, 85.73516496292531 0.189812528, 0.004472994804382324, 84.8704442106819 0.268435456, 0.006230497360229492, 86.16822718310648 0.37962506, 0.008964014053344727, 84.6997913525919 ## XEON 6230 bfloat16 benchmarking cpu using torch.bfloat16 size, elapsed_time, tops 256, 0.001166057586669922, 0.028775964741009238 304, 0.0003167867660522461, 0.17737144988794462 362, 0.0004210948944091797, 0.22530754293071226 430, 0.0005740880966186524, 0.2769853632858507 512, 0.000850057601928711, 0.31578501902805406 608, 0.0014193534851074218, 0.31670153257557215 724, 0.002286386489868164, 0.33196786779638704 861, 0.003532099723815918, 0.3614152662204194 1024, 0.005914664268493653, 0.3630778604694872 1217, 0.008795619010925293, 0.40985979741984746 1448, 0.014617276191711426, 0.41540261703771425 1722, 0.024225759506225585, 0.42155285547912696 2048, 0.038063979148864745, 0.45134191348758107 2435, 0.06337082386016846, 0.4556564676153674 2896, 0.08836662769317627, 0.5497147457144741 3444, 0.1559471845626831, 0.5238921433375465 4096, 0.2487639904022217, 0.552487332470337 4870, 0.41010868549346924, 0.5632716744880513 5792, 0.636151385307312, 0.610878974960134 6888, 1.0622331142425536, 0.6153037684294559 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 5.710124969482422e-05, 146.90760788656365 0.00593164, 7.467269897460937e-05, 158.8703791734355 0.008388608, 0.00011763572692871093, 142.6200733231942 0.01186328, 0.00030057430267333985, 78.93742009537559 0.016777216, 0.0006120920181274414, 54.81926084031005 0.023726564, 0.0010343313217163086, 45.87807311225872 0.033554432, 0.002415776252746582, 27.779420351409424 0.047453132, 0.005098819732666016, 18.613378973171983 0.067108864, 0.006124520301818847, 21.91481477498577 0.094906264, 0.00918436050415039, 20.666929168799957 0.134217728, 0.013114047050476075, 20.469307069495002 0.189812528, 0.025612187385559083, 14.822047421613195 0.268435456, 0.027147817611694335, 19.775840536394895 0.37962506, 0.0396291971206665, 19.15885698335416 ## XEON 6230 INT8 benchmarking cpu using torch.int8 size, elapsed_time, tops 256, 0.0011035919189453125, 0.030404746015236777 304, 0.0028478622436523436, 0.019730212767573505 362, 0.003856062889099121, 0.024604333157586422 430, 0.006611490249633789, 0.0240511585128342 512, 0.007130289077758789, 0.03764720519359018 608, 0.0132371187210083, 0.03395840389998101 724, 0.029151320457458496, 0.02603679133875408 861, 0.052414536476135254, 0.024354975696126307 1024, 0.06348373889923095, 0.033827302632706384 1217, 0.11844491958618164, 0.030435840039360992 1448, 0.21307857036590577, 0.028496787704051427 1722, 0.3561347484588623, 0.028675769888204704 2048, 0.5653518438339233, 0.030387924566576144 2435, 0.9055092811584473, 0.03188849231126473 2896, 1.6462963581085206, 0.029506496830139946 3444, 3.509342336654663, 0.02328057423029335 4096, 8.10109314918518, 0.01696548242823548 4870, 14.660134220123291, 0.01575719584360376 5792, 25.486044383049013, 0.015248011826993006 6888, 44.70684518814087, 0.014619596515778656 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 6.196498870849609e-05, 135.37657594779532 0.00593164, 9.491443634033204e-05, 124.98920561949258 0.008388608, 0.00016396045684814454, 102.32476977993892 0.01186328, 0.000305485725402832, 77.66830993072661 0.016777216, 0.0006456851959228515, 51.96716946877188 0.023726564, 0.0012836456298828125, 36.967467418817236 0.033554432, 0.0023816823959350586, 28.177083608854897 0.047453132, 0.00407719612121582, 23.277335987384127 0.067108864, 0.006319093704223633, 21.24002812464862 0.094906264, 0.009533262252807618, 19.91055348803593 0.134217728, 0.014014458656311036, 19.154179450172144 0.189812528, 0.020318937301635743, 18.683312535712137 0.268435456, 0.02902853488922119, 18.494592098733502 0.37962506, 0.04167752265930176, 18.217256486346063 ## M1 Pro CPU - INT8 benchmarking cpu using torch.int8 size, elapsed_time, tops 256, 0.004259657859802246, 0.007877259889027275 304, 0.007328653335571289, 0.007667019495556467 362, 0.01215658187866211, 0.007804484595010316 430, 0.020352959632873535, 0.007812819504794035 512, 0.03251914978027344, 0.008254688631583984 608, 0.05446903705596924, 0.008252604567584112 724, 0.09682230949401856, 0.007839173140637484 861, 0.1643320083618164, 0.007768144348296151 1024, 0.2544929265975952, 0.008438284225461425 1217, 0.4326848745346069, 0.008331630796841428 1448, 0.7112574338912964, 0.008537070397675465 1722, 1.2066871166229247, 0.008463203058453855 2048, 1.9730838775634765, 0.008707115485234763 2435, 3.3893067121505736, 0.008519537534470614 2896, 5.669154453277588, 0.008568550861039925 3444, 9.958435702323914, 0.008204050034578674 4096, 16.59977195262909, 0.008279568771439191 4870, 28.29214940071106, 0.008164901249750726 5792, 47.3701210975647, 0.00820372625553576 6888, 80.74701671600342, 0.008094367627757355 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 3.3855438232421875e-05, 247.77726823121128 0.00593164, 7.402896881103516e-05, 160.2518607314654 0.008388608, 9.772777557373046e-05, 171.67295481254942 0.01186328, 0.0001516103744506836, 156.49694215165906 0.016777216, 0.0002319812774658203, 144.64284517505448 0.023726564, 0.0004257917404174805, 111.44680249897083 0.033554432, 0.0006222724914550781, 107.8448186630866 0.047453132, 0.0009853601455688476, 96.3163209175775 0.067108864, 0.0013759851455688477, 97.54300650136226 0.094906264, 0.0025406122207641602, 74.71133392521767 0.134217728, 0.003232312202453613, 83.04750258846703 0.189812528, 0.004046845436096192, 93.8076489440148 0.268435456, 0.005482769012451172, 97.91966628190707 0.37962506, 0.008221673965454101, 92.34738852333764 ## M1 Pro GPU - FP16 benchmarking mps using torch.float16 size, elapsed_time, tops 256, 0.009627270698547363, 0.003485352500274346 304, 0.006162810325622559, 0.009117419656157253 362, 0.011301898956298828, 0.008394682731358462 430, 0.007365679740905762, 0.021588503110840648 512, 0.0011082172393798828, 0.24222277587939936 608, 0.0034063577651977537, 0.1319624816255623 724, 0.008305120468139648, 0.09139022737981042 861, 0.0017158985137939453, 0.7439570299396482 1024, 0.0019629955291748046, 1.0939829541551476 1217, 0.0027472972869873047, 1.3121880340635514 1448, 0.014558553695678711, 0.4170781597489533 1722, 0.0037976980209350588, 2.6891127308446503 2048, 0.004535555839538574, 3.787820014084051 2435, 0.0075850248336791996, 3.8068861187885794 2896, 0.022348809242248534, 2.1735582305732133 3444, 0.01922299861907959, 4.250091590128399 4096, 0.02993953227996826, 4.590551121066 4870, 0.052914762496948244, 4.365560669639642 5792, 0.0893251657485962, 4.35052656123515 6888, 0.15182690620422362, 4.304876220456225 size (GB), elapsed_time, bandwidth (GB/s) 0.004194304, 0.00025799274444580076, 32.51489888996581 0.00593164, 0.000435328483581543, 27.251329622169887 0.008388608, 0.0002537250518798828, 66.12360851124225 0.01186328, 0.0002813577651977539, 84.32879036881621 0.016777216, 0.00034019947052001955, 98.63164086854579 0.023726564, 0.0004561424255371094, 104.03138437325528 0.033554432, 0.0005932807922363281, 113.11484355837325 0.047453132, 0.000733184814453125, 129.44384843920918 0.067108864, 0.0009582281112670898, 140.06866050143364 0.094906264, 0.0012568950653076172, 151.01700471196023 0.134217728, 0.0016906261444091797, 158.7787204685692 0.189812528, 0.0022737979888916016, 166.95636897148202 0.268435456, 0.0034802913665771484, 154.26033496960062 0.37962506, 0.004375720024108886, 173.51432811440466