Last active
April 16, 2021 05:04
-
-
Save mingfeima/c7b4d9ef30f713e51a7568ae665f1dbd to your computer and use it in GitHub Desktop.
Revisions
-
mingfeima revised this gist
Apr 16, 2021 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -14,7 +14,7 @@ Task list: - [x] Softmax (bf16) - [x] cumsum (int64_t) - [ ] tranposed copy (fp32/bf16) - [x] offset range (int64_t) - [x] sigmoid/sigmoid_backward (bf16) ## LAMB optimizer -
mingfeima revised this gist
Apr 15, 2021 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ _This Gist records optimization effort of **DLRM** on PyTorch CPU path._ Branch on track: [dlrm](https://github.com/mingfeima/pytorch/commits/dlrm) -
mingfeima revised this gist
Apr 15, 2021 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -14,6 +14,7 @@ Task list: - [x] Softmax (bf16) - [x] cumsum (int64_t) - [ ] tranposed copy (fp32/bf16) - [ ] offset range (int64_t) - [x] sigmoid/sigmoid_backward (bf16) ## LAMB optimizer @@ -102,7 +103,6 @@ optimizer = optim.Lamb(model.parameters(), lr=0.01, fused=True) ```bash ### LAMB unfused (fp32): 0.4526 ms; fused (fp32): 0.0940 ms; split fused (bf16): 0.0879 ms ``` #### Testing -
mingfeima revised this gist
Apr 15, 2021 . 1 changed file with 25 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -86,10 +86,33 @@ To reproduce the result (notice that jemalloc is applied): ``` ## Split SGB (BFloat16) Basic idea of the algorithm is to store a copy of master weight in fp32 by splitting the upper 16 bits and lower 16 bits. The lower half is stored in optimizer as a state. So the weight could be updated in fp32 through packing and unpacking.  #### Usage The usage is identical to normal fp32 fused kernel, with `fused=True`, parameter with data type `torch.bfloat16` would automatically use split sgd algorithm: ```python ### fused=True will use native C++ fused kernel from ATen ### fused=False will fallback to imperative torch impl, used for validation purposes optimizer = optim.Lamb(model.parameters(), lr=0.01, fused=True) ``` #### Performance ```bash ### LAMB unfused (fp32): 0.4526 ms; fused (fp32): 0.0940 ms; split fused (bf16): 0.0879 ms ### ``` #### Testing ```bash python test_optim.py TestSplitSGD.test_lamb_bfloat16_cpu python test_optim.py TestSplitSGD.test_adagrad_bfloat16_cpu ``` [Notes]: Known issue: this impl is expected to have runtime error on AVX machine, make sure you have AVX2+ CPU. (I did not register the AVX kernels) ## Gerneric BF16 Operator Optimization -
mingfeima revised this gist
Apr 15, 2021 . 1 changed file with 7 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -7,7 +7,7 @@ Task list: - [x] LAMB fused optimizer (fp32) - [x] Adagrad fused optimier (fp32) - [x] Split-SGD (bf16) - [x] Bucketize (bf16) - [x] Sum (bf16) - [x] LayerNorm (bf16) @@ -85,6 +85,12 @@ To reproduce the result (notice that jemalloc is applied): ./run.sh test_fused_adagrad.py ``` ## Split SGB (BFloat16) The algorithm is described in paper: Basic idea is to store a copy of master weight in fp32 by splitting the upper 16 bits and lower 16 bits  ## Gerneric BF16 Operator Optimization #### Principle -
mingfeima revised this gist
Apr 2, 2021 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -12,9 +12,9 @@ Task list: - [x] Sum (bf16) - [x] LayerNorm (bf16) - [x] Softmax (bf16) - [x] cumsum (int64_t) - [ ] tranposed copy (fp32/bf16) - [x] sigmoid/sigmoid_backward (bf16) ## LAMB optimizer -
mingfeima revised this gist
Mar 23, 2021 . No changes.There are no files selected for viewing
-
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 5 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -133,6 +133,11 @@ cd pytorch/build/bin/ vec256_test_all_types_AVX vec256_test_all_types_AVX2 vec256_test_all_types_DEFAULT ``` ```bash python test_nn.py TestNN.test_log_softmax_cpu python test_nn.py TestNN.test_softmax_cpu ``` #### Sum Naive Impl: ```bash -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 29 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -131,4 +131,33 @@ Test: ``` cd pytorch/build/bin/ vec256_test_all_types_AVX vec256_test_all_types_AVX2 vec256_test_all_types_DEFAULT ``` #### Sum Naive Impl: ```bash sum size: 128x30678, fp32: 0.588 ms; bf16: 0.899 ms ``` Funtional Specialization: ```bash sum size: 128x30678, fp32: 0.590 ms; bf16: 0.335 ms ``` #### LayerNorm Naive Impl: ```bash LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.806 ms; bf16: 9.901 ms tensor max (abs) diff: 0.1355377435684204 ``` Funtional Specialization: ```bash LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.813 ms; bf16: 2.306 ms tensor max (abs) diff: 0.04277598857879639 ``` Test ```bash python test_nn.py TestNNDeviceTypeCPU.test_LayerNorm_general_cpu ``` -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 23 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -96,8 +96,8 @@ BFloat16 is not an actual data type, we need to handle BFloat16 operator in the We have multiple ways to enable BFloat16 OP on PyTorch, namely: 1. **Naive Impl**: add `kBFloat16` to `AT_DISPATCH_FLOATING_TYPES` macro, since on PyTorch both scalar and Vec256<> logic has specialization for `BFloat16`, this could run smoothly. But this naive impl is not good. 2. **Funtional Specialization**: specialize `vec256::Map<>` from functional.cpp with `BFloat16`. Similar to oneDNN implementation. 3. **Cache FP32 Data**: Convert bf16 data to fp32 per input row and cache (possibly) in L1. Similar to cuda counterpart implementation. Consider the following example: ```C++ @@ -111,3 +111,24 @@ Impl-1 will end up with 3 pairs of dtype conversion, each for ".exp()", "+" and 2. less rounding error since immediate results are kept in fp32; 3. accumulation done on data type of fp32. For Impl-2 and Impl-3, with emulated dtype conversion Impl-3 is faster for most cases; with native conversion assembly, Impl-2 is faster. So I follow Impl-2 in these patches. #### Softmax Naive Impl: ```bash Softmax: 128x1024: fp32: 150.324 us; bf16: 356.587 us tensor max (abs) diff: 2.9515125788748264e-05 ``` Funtional Specialization: ```bash log_softmax: 128x1024: fp32: 150.132 us; bf16: 194.974 us tensor max (abs) diff: 1.509662251919508e-05 ``` Test: ``` cd pytorch/build/bin/ vec256_test_all_types_AVX vec256_test_all_types_AVX2 vec256_test_all_types_DEFAULT ``` -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 7 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -97,11 +97,17 @@ We have multiple ways to enable BFloat16 OP on PyTorch, namely: 1. **Naive Impl**: add `kBFloat16` to `AT_DISPATCH_FLOATING_TYPES` macro, since on PyTorch both scalar and Vec256<> logic has specialization for `BFloat16`, this could run smoothly. But this naive impl is not good. 2. **Funtional Specialization**: specialize `vec256::Map<>` from functional.cpp with `BFloat16`. 3. **Cache FP32 Data**: Convert bf16 data to fp32 per input row and cache (possibly) in L1. Consider the following example: ```C++ using Vec = Vec256<BFloat16>; Vec one = Vec(BFloat16(1)); vec256::map([](Vec x) { return one / (one + x.exp()); }, y_ptr, x_ptr, N); ``` Impl-1 will end up with 3 pairs of dtype conversion, each for ".exp()", "+" and "/". Both Impl-2 and Impl-3 will only need dtype conversion for input and output. Benefits: 1. better performance since we have less dtype conversion; 2. less rounding error since immediate results are kept in fp32; 3. accumulation done on data type of fp32. -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 10 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -95,4 +95,13 @@ BFloat16 is not an actual data type, we need to handle BFloat16 operator in the #### Implementation Details We have multiple ways to enable BFloat16 OP on PyTorch, namely: 1. **Naive Impl**: add `kBFloat16` to `AT_DISPATCH_FLOATING_TYPES` macro, since on PyTorch both scalar and Vec256<> logic has specialization for `BFloat16`, this could run smoothly. But this naive impl is not good. 2. **Funtional Specialization**: specialize `vec256::Map<>` from functional.cpp with `BFloat16`. 3. **Cache FP32 ** Consider the following example: ```C++ using Vec = Vec256<BFloat16>; Vec one = Vec(BFloat16(1)); vec256::map([](Vec x) { return one / (one + x.exp()); }, y_ptr, x_ptr, N); ``` -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 0 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -96,4 +96,3 @@ BFloat16 is not an actual data type, we need to handle BFloat16 operator in the We have multiple ways to enable BFloat16 OP on PyTorch, namely: **Naive Impl**: add `kBFloat16` to `AT_DISPATCH_FLOATING_TYPES` macro, since on PyTorch both scalar and Vec256<> logic has specialization for `BFloat16`, this could run smoothly. But this naive impl is not good. -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 3 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -94,4 +94,6 @@ BFloat16 is not an actual data type, we need to handle BFloat16 operator in the #### Implementation Details We have multiple ways to enable BFloat16 OP on PyTorch, namely: **Naive Impl**: add `kBFloat16` to `AT_DISPATCH_FLOATING_TYPES` macro, since on PyTorch both scalar and Vec256<> logic has specialization for `BFloat16`, this could run smoothly. But this naive impl is not good. - -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 7 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -88,3 +88,10 @@ To reproduce the result (notice that jemalloc is applied): ## Gerneric BF16 Operator Optimization #### Principle BFloat16 is not an actual data type, we need to handle BFloat16 operator in the following manner: - input/output: load: bf16->fp32; store: fp32->bf16 - immediate operations (including accumulation): use fp32 #### Implementation Details We have multiple ways to enable BFloat16 OP on PyTorch, namely: - Naive Impl: add `kBFloat16` to `AT_DISPATCH_FLOATING_TYPES` macro, since on PyTorch both scalar and Vec256<> logic has specialization for `BFloat16`, this could run smoothly. But this naive impl is not good -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 13 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -7,6 +7,14 @@ Task list: - [x] LAMB fused optimizer (fp32) - [x] Adagrad fused optimier (fp32) - [ ] Split-SGD (bf16) - [x] Bucketize (bf16) - [x] Sum (bf16) - [x] LayerNorm (bf16) - [x] Softmax (bf16) - [ ] cumsum (int64_t) - [ ] tranposed copy (fp32/bf16) - [ ] sigmoid/sigmoid_backward (bf16) ## LAMB optimizer @@ -75,4 +83,8 @@ unfused: 0.1022 ms; fused: 0.0321 ms To reproduce the result (notice that jemalloc is applied): ```bash ./run.sh test_fused_adagrad.py ``` ## Gerneric BF16 Operator Optimization #### Principle -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -5,8 +5,8 @@ Branch on track: [dlrm](https://github.com/mingfeima/pytorch/commits/dlrm) Task list: - [x] LAMB fused optimizer (fp32) - [x] Adagrad fused optimier (fp32) ## LAMB optimizer -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 1 addition and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,6 +4,7 @@ _This Gist records optimization effort of **DLRRM** on PyTorch CPU path._ Branch on track: [dlrm](https://github.com/mingfeima/pytorch/commits/dlrm) Task list: [x] - LAMB fused optimizer (fp32) [x] - Adagrad fused optimier (fp32) -
mingfeima revised this gist
Mar 23, 2021 . 1 changed file with 5 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,8 +1,12 @@ _This Gist records optimization effort of **DLRRM** on PyTorch CPU path._ Branch on track: [dlrm](https://github.com/mingfeima/pytorch/commits/dlrm) Task list: [x] - LAMB fused optimizer (fp32) [x] - Adagrad fused optimier (fp32) ## LAMB optimizer **LAMB optimizer** - proposed in Papar [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/pdf/1904.00962.pdf). -
mingfeima revised this gist
Mar 2, 2021 . 1 changed file with 5 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -65,4 +65,9 @@ I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual ```bash ### ADAGRAD optimier bench: unfused: 0.1022 ms; fused: 0.0321 ms ``` To reproduce the result (notice that jemalloc is applied): ```bash ./run.sh test_fused_adagrad.py ``` -
mingfeima revised this gist
Mar 2, 2021 . 1 changed file with 24 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,6 @@ _This Gist records optimization effort of **LAMB optimizer** on PyTorch CPU path._ Branch on track: [dlrm](https://github.com/mingfeima/pytorch/commits/dlrm) ## LAMB optimizer @@ -42,4 +43,26 @@ To reproduce the result (notice that jemalloc is applied): [Notes] - perf speedup primarily comes from: a) reduce of memory bandwidth of immediate tensors; b) the kernel has no additional memory allocation. For temp result of `adam_step`, it reuses the memory of `grad`. So the kernel rewrites the gradient tensor since gradient is no longer used after the update of weight. - 4.9x perf speedup is tested on weight size of nn.Linear(1024, 1024). Speedup ratio would be greater if the weight tensor size is bigger. - thread synchronization, the algorithm itself requires thread sync (e.g. norm of weight and adam_step). Ideally, we could do this with `#pragma omp barrier` thus we can finish the whole computation within a single omp session. But this would trigger a bug: PyTorch omp wrapper `at::parallel` will not make sure all omp threads in the same TEAM to be used (N=64 will launch 16 threads even the #cores is 20), so the un-used thread will never reach the barrier and keep on waiting. So i break the code into 2 omp sessions. ## Adagrad Fusion #### Usage ```python ### fused=True will use native C++ fused kernel from ATen ### fused=False will fallback to imperative torch impl, used for validation purposes optimizer = optim.Adagrad(model.parameters(), lr=0.01, fused=True) ``` #### Testing The mnist from pytorch/examples converges as ```bash Test set: Average loss: 0.0363, Accuracy: 9881/10000 (99%) ``` #### Performance I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual sockets. For **single socket** run (with jemalloc), the update step of a [1024, 1024] weight tensor achieves **3.2x** speedup: ```bash ### ADAGRAD optimier bench: unfused: 0.1022 ms; fused: 0.0321 ms ``` -
mingfeima revised this gist
Mar 2, 2021 . 1 changed file with 1 addition and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,6 @@ _This Gist records optimization effort of **LAMB optimizer** on PyTorch CPU path._ Branch on track: [dlrm](https://github.com/mingfeima/pytorch/commits/dlrm) ## LAMB optimizer -
mingfeima renamed this gist
Mar 2, 2021 . 1 changed file with 6 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,21 +1,22 @@ _This Gist records optimization effort of **LAMB optimizer** on PyTorch CPU path._ ## LAMB optimizer **LAMB optimizer** - proposed in Papar [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/pdf/1904.00962.pdf). This implementation refers to fbgemm's gpu code at [gpu_ref](https://github.com/pytorch/FBGEMM/blob/29fa98eb8ad521169366ead870a8b16f6b907b70/fbgemm_gpu/codegen/embedding_backward_code_generator.py#L373). To use this CPU fused LAMB kernel, you need to cherry-pick [cf5e826b](https://github.com/mingfeima/pytorch/commit/cf5e826b185c83bc88fb57f8f26f29fac927379b) and build from source. #### Usage ```python ### fused=True will use native C++ fused kernel from ATen ### fused=False will fallback to imperative torch impl, used for validation purposes optimizer = optim.Lamb(model.parameters(), lr=0.01, fused=True) ``` #### Testing Test case posted below as `test_fused_lamb.py`, both contiguous and non-contiguous cases are tested. The weight tensor could be non-contiguous on occassion of explict fusion of multiple `nn.Linear` modules. @@ -24,7 +25,7 @@ The mnist from pytorch/examples converges as Test set: Average loss: 0.0297, Accuracy: 9934/10000 (99%) ``` #### Performance I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual sockets. For **single socket** run (with jemalloc), the update step of a [1024, 1024] weight tensor achieves **4.9x** speedup: ```bash -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,6 +6,8 @@ This Gist records optimization effort of **LAMB optimizer** on PyTorch CPU path. This implementation refers to fbgemm's gpu code at [gpu_ref](https://github.com/pytorch/FBGEMM/blob/29fa98eb8ad521169366ead870a8b16f6b907b70/fbgemm_gpu/codegen/embedding_backward_code_generator.py#L373). To use this CPU fused LAMB kernel, you need to cherry-pick [cf5e826b](https://github.com/mingfeima/pytorch/commit/cf5e826b185c83bc88fb57f8f26f29fac927379b) and build from source. ## Usage ```python ### fused=True will use native C++ fused kernel from ATen -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -30,7 +30,7 @@ I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual unfused: 0.4495 ms; fused: 0.0923 ms ``` To reproduce the result (notice that jemalloc is applied): ```bash ./run.sh test_fused_lamb.py ``` -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 6 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -30,8 +30,12 @@ I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual unfused: 0.4495 ms; fused: 0.0923 ms ``` To reproduce the result: ```bash ./run.sh test_fused_lamb.py ``` [Notes] - perf speedup primarily comes from: a) reduce of memory bandwidth of immediate tensors; b) the kernel has no additional memory allocation. For temp result of `adam_step`, it reuses the memory of `grad`. So the kernel rewrites the gradient tensor since gradient is no longer used after the update of weight. - 4.9x perf speedup is tested on weight size of nn.Linear(1024, 1024). Speedup ratio would be greater if the weight tensor size is bigger. - thread synchronization, the algorithm itself requires thread sync (e.g. norm of weight and adam_step). Ideally, we could do this with `#pragma omp barrier` thus we can finish the whole computation within a single omp session. But this would trigger a bug: PyTorch omp wrapper `at::parallel` will not make sure all omp threads in the same TEAM to be used (N=64 will launch 16 threads even the #cores is 20), so the un-used thread will never reach the barrier and keep on waiting. So i break the code into 2 omp sessions. -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -33,4 +33,5 @@ unfused: 0.4495 ms; fused: 0.0923 ms [Notes] - perf speedup primarily comes from: a) reduce of memory bandwidth of immediate tensors; b) the kernel has no additional memory allocation. For temp result of `adam_step`, it reuses the memory of `grad`. So the kernel rewrites the gradient tensor since gradient is no longer used after the update of weight. - 4.9x perf speedup is tested on weight size of nn.Linear(1024, 1024). Speedup ratio would be greater if the weight tensor size is bigger. - thread synchronization, the algorithm itself requires thread sync (e.g. norm of weight and adam_step). Ideally, we could do this with `#pragma omp barrier` thus we can finish the whole computation within a single omp session. But this would trigger a bug: PyTorch omp wrapper `at::parallel` will not make sure all omp threads in the same TEAM to be used (N=64 will launch 16 threads even the #cores is 20), so the un-used thread will never reach the barrier and keep on waiting. So i break the code into 2 omp sessions. - the kernel downcase input parameters of `eps`, `weight_decay` from double to float for vectorization purpose. So if `eps` is smaller than float's [numerical limit](https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon), it would be inaccurate! All the default paramters (e.g. -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -32,4 +32,5 @@ unfused: 0.4495 ms; fused: 0.0923 ms [Notes] - perf speedup primarily comes from: a) reduce of memory bandwidth of immediate tensors; b) the kernel has no additional memory allocation. For temp result of `adam_step`, it reuses the memory of `grad`. So the kernel rewrites the gradient tensor since gradient is no longer used after the update of weight. - 4.9x perf speedup is tested on weight size of nn.Linear(1024, 1024). Speedup ratio would be greater if the weight tensor size is bigger. - thread synchronization, the algorithm itself requires thread sync (e.g. norm of weight and adam_step). Ideally, we could do this with `#pragma omp barrier` thus we can finish the whole computation within a single omp session. But this would trigger a bug: PyTorch omp wrapper `at::parallel` will not make sure all omp threads in the same TEAM to be used (N=64 will launch 16 threads even the #cores is 20), so the un-used thread will never reach the barrier and keep on waiting. -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 4 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -29,3 +29,7 @@ I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual ### LAMB optimier bench: unfused: 0.4495 ms; fused: 0.0923 ms ``` [Notes] - perf speedup primarily comes from: a) reduce of memory bandwidth of immediate tensors; b) the kernel has no additional memory allocation. For temp result of `adam_step`, it reuses the memory of `grad`. So the kernel rewrites the gradient tensor since gradient is no longer used after the update of weight. - 4.9x perf speedup is tested on weight size of nn.Linear(1024, 1024). Speedup ratio would be greater if the weight tensor size is bigger. -
mingfeima revised this gist
Feb 26, 2021 . 1 changed file with 6 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -23,4 +23,9 @@ Test set: Average loss: 0.0297, Accuracy: 9934/10000 (99%) ``` ## Performance I tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores per socket, dual sockets. For **single socket** run (with jemalloc), the update step of a [1024, 1024] weight tensor achieves **4.9x** speedup: ```bash ### LAMB optimier bench: unfused: 0.4495 ms; fused: 0.0923 ms ```
NewerOlder