datatalking · August 3, 2021 18:31 · May 11, 2018 · May 11, 2018 · May 11, 2018 · May 10, 2018
diff --git a/tensorflow_1_8_high_sierra_gpu.md b/tensorflow_1_8_high_sierra_gpu.md
@@ -1,4 +1,4 @@
-# Tensorflow 1.8 with CUDA on macOS High Sierra 10.13.4 for eGPU
+# Tensorflow 1.8 with CUDA on macOS High Sierra 10.13.4
 
 Largely based on the [Tensorflow 1.6 gist](https://gist.github.com/mattiasarro/1f3498a26ad111a8d99199eaf64551be), 
 and [Tensorflow 1.7 gist for xcode](https://gist.github.com/pavelmalik/d51036d508c8753c86aed1f3ff1e6967)

diff --git a/tensorflow_1_8_high_sierra_gpu.md b/tensorflow_1_8_high_sierra_gpu.md
@@ -1,7 +1,8 @@
 # Tensorflow 1.8 with CUDA on macOS High Sierra 10.13.4 for eGPU
 
 Largely based on the [Tensorflow 1.6 gist](https://gist.github.com/mattiasarro/1f3498a26ad111a8d99199eaf64551be), 
-and [Tensorflow 1.7 gist for xcode](https://gist.github.com/pavelmalik/d51036d508c8753c86aed1f3ff1e6967), this should hopefully simplify things a bit.
+and [Tensorflow 1.7 gist for xcode](https://gist.github.com/pavelmalik/d51036d508c8753c86aed1f3ff1e6967)
+and [Tensorflow 1.7 gist for eGPU](https://gist.github.com/Willian-Zhang/088e017774536880bd425178b46b8c17), this should hopefully simplify things a bit.
 
 ## Requirements
 
@@ -196,11 +197,11 @@ git checkout v1.8.0
 ```
 
 #### Apply Patch
-Apply the following [patch](https://gist.github.com/Willian-Zhang/a3bd10da2d8b343875f3862b2a62eb3b#file-xtensorflow18macos.patch) to fix a couple build issues:
+Apply the following [patch](https://gist.github.com/Willian-Zhang/a3bd10da2d8b343875f3862b2a62eb3b#file-xtensorflow18macos-patch) to fix a couple build issues:
 
 ``` bash
 wget https://gist.github.com/Willian-Zhang/a3bd10da2d8b343875f3862b2a62eb3b/raw/xtensorflow18macos.patch
-git apply xtensorflow17macos.patch
+git apply xtensorflow18macos.patch
 ```
 
 
@@ -362,64 +363,63 @@ wget https://gist.github.com/Willian-Zhang/290dceb96679c8f413e42491c9
 python mnist_cnn.py
 ```
 ```
-/usr/local/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
+/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
   from ._conv import register_converters as _register_converters
 Using TensorFlow backend.
 x_train shape: (60000, 28, 28, 1)
 60000 train samples
 10000 test samples
 Train on 60000 samples, validate on 10000 samples
 Epoch 1/12
-2018-04-08 03:29:00.155517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
-2018-04-08 03:29:00.155661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
+2018-05-11 04:51:10.335377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
+2018-05-11 04:51:10.336052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
 name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
 pciBusID: 0000:c4:00.0
-totalMemory: 11.00GiB freeMemory: 10.11GiB
-2018-04-08 03:29:00.155677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
-2018-04-08 03:29:00.562343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
-2018-04-08 03:29:00.562373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
-2018-04-08 03:29:00.562403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
-2018-04-08 03:29:00.562536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9781 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:c4:00.0, compute capability: 6.1)
-2018-04-08 03:29:00.563022: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 9.55G (10256140800 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
-2018-04-08 03:29:00.868307: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
-2018-04-08 03:29:00.906005: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
-2018-04-08 03:29:00.973462: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
-59904/60000 [============================>.] - ETA: 0s - loss: 0.2624 - acc: 0.92022018-04-08 03:29:07.381067: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
-60000/60000 [==============================] - 8s 129us/step - loss: 0.2620 - acc: 0.9203 - val_loss: 0.0587 - val_acc: 0.9825
+totalMemory: 11.00GiB freeMemory: 9.37GiB
+2018-05-11 04:51:10.336075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
+2018-05-11 04:51:11.063831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
+2018-05-11 04:51:11.063856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
+2018-05-11 04:51:11.063864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
+2018-05-11 04:51:11.064768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9065 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:c4:00.0, compute capability: 6.1)
+2018-05-11 04:51:11.534095: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+2018-05-11 04:51:11.579370: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+2018-05-11 04:51:11.644835: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+59264/60000 [============================>.] - ETA: 0s - loss: 0.2604 - acc: 0.92082018-05-11 04:51:19.228205: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+60000/60000 [==============================] - 10s 159us/step - loss: 0.2588 - acc: 0.9213 - val_loss: 0.0561 - val_acc: 0.9829
 Epoch 2/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0891 - acc: 0.9733 - val_loss: 0.0437 - val_acc: 0.9850
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0875 - acc: 0.9742 - val_loss: 0.0427 - val_acc: 0.9857
 Epoch 3/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0681 - acc: 0.9789 - val_loss: 0.0341 - val_acc: 0.9881
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0662 - acc: 0.9803 - val_loss: 0.0356 - val_acc: 0.9875
 Epoch 4/12
-60000/60000 [==============================] - 4s 67us/step - loss: 0.0569 - acc: 0.9829 - val_loss: 0.0398 - val_acc: 0.9859
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0549 - acc: 0.9839 - val_loss: 0.0325 - val_acc: 0.9896
 Epoch 5/12
-60000/60000 [==============================] - 4s 70us/step - loss: 0.0480 - acc: 0.9856 - val_loss: 0.0303 - val_acc: 0.9898
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0471 - acc: 0.9859 - val_loss: 0.0309 - val_acc: 0.9901
 Epoch 6/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0438 - acc: 0.9869 - val_loss: 0.0288 - val_acc: 0.9897
+60000/60000 [==============================] - 4s 68us/step - loss: 0.0421 - acc: 0.9873 - val_loss: 0.0297 - val_acc: 0.9903
 Epoch 7/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0379 - acc: 0.9881 - val_loss: 0.0287 - val_acc: 0.9905
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0377 - acc: 0.9884 - val_loss: 0.0259 - val_acc: 0.9908
 Epoch 8/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0357 - acc: 0.9892 - val_loss: 0.0277 - val_acc: 0.9915
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0357 - acc: 0.9883 - val_loss: 0.0285 - val_acc: 0.9908
 Epoch 9/12
-60000/60000 [==============================] - 4s 65us/step - loss: 0.0329 - acc: 0.9898 - val_loss: 0.0268 - val_acc: 0.9906
+60000/60000 [==============================] - 4s 68us/step - loss: 0.0315 - acc: 0.9904 - val_loss: 0.0327 - val_acc: 0.9901
 Epoch 10/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0312 - acc: 0.9903 - val_loss: 0.0295 - val_acc: 0.9911
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0288 - acc: 0.9910 - val_loss: 0.0272 - val_acc: 0.9911
 Epoch 11/12
-60000/60000 [==============================] - 4s 66us/step - loss: 0.0281 - acc: 0.9908 - val_loss: 0.0292 - val_acc: 0.9908
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0282 - acc: 0.9912 - val_loss: 0.0248 - val_acc: 0.9920
 Epoch 12/12
-60000/60000 [==============================] - 4s 65us/step - loss: 0.0277 - acc: 0.9917 - val_loss: 0.0260 - val_acc: 0.9919
-Test loss: 0.02598250026818114
-Test accuracy: 0.9919
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0255 - acc: 0.9923 - val_loss: 0.0283 - val_acc: 0.9912
+Test loss: 0.028254894825743667
+Test accuracy: 0.9912
 ```
 
 
 You can use [cuda-smi](https://github.com/phvu/cuda-smi) to watch the GPU memory usages. In case the of the mnist example in keras, you should see the free memory drop down to maybe 2% and the fans spin up. Not quite sure what the grappler/clusters/utils.cc:127 warning is, however.
 
 ```
-$ ./cuda-smi.dms 
+$ cuda-smi
 Device 0 [PCIe 0:196:0.0]: GeForce GTX 1080 Ti (CC 6.1): 10350 of 11264 MB (i.e. 91.9%) Free
 # when GPU
-$ ./cuda-smi.dms 
+$ cuda-smi
 Device 0 [PCIe 0:196:0.0]: GeForce GTX 1080 Ti (CC 6.1): 1181.1 of 11264 MB (i.e. 10.5%) Free
 ```
 

diff --git a/tensorflow_1_8_high_sierra_gpu.md b/tensorflow_1_8_high_sierra_gpu.md
@@ -28,9 +28,9 @@ The rest steps are the same as normal GPU setup.
 #### Check and use pre-compiliation (Optional, Risky, Please Skip if you don't understand)
 If you are like me using MacBook Pro (15-inch, 2016) runing 10.13.4 (17E199) and eGPU: NVIDIA GeForce GTX 1080 Ti 11 GiB (or any 6.1 compatible version in [nvidia page](https://developer.nvidia.com/cuda-gpus)). 
 You could, at your own risk, skip the `Prepare` and `Compile` steps below, 
-[download .whl from here](https://github.com/Willian-Zhang/tensorflow-precompile/raw/r1.7/tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl) and install it:
+[download .whl from here](https://github.com/Willian-Zhang/tensorflow-precompile/raw/r1.8/tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl) and install it:
 ``` bash
-$ pip install tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl
+pip install tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl
 ```
 
 And be sure to test after installation.
@@ -39,8 +39,8 @@ But remember this is **not safe**.
 #### Install Homwbrew (Optional)
 For package management, ignore if you have your own `python`, `wget` or you want to download manually.
 ``` bash
-$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
-$ brew install wget
+/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
+brew install wget
 ```
 
 #### NVIDIA Graphics driver
@@ -63,14 +63,14 @@ Unarchive and rename `XCode.app` to `Xcode8.2.app` in case you want to build and
 #### Install Bazel
 
 If you have Homebrew installed
-```
-$ brew install bazel
+``` bash
+brew install bazel
 ```
 or Download the binary [here](https://github.com/bazelbuild/bazel/releases/download/0.10.0/bazel-0.10.0-installer-darwin-x86_64.sh)
 
-```
-$ chmod 755 bazel-0.10.0-installer-darwin-x86_64.sh
-$ ./bazel-0.10.0-installer-darwin-x86_64.sh
+```bash
+chmod 755 bazel-0.10.0-installer-darwin-x86_64.sh
+./bazel-0.10.0-installer-darwin-x86_64.sh
 ```
 
 
@@ -82,7 +82,15 @@ It should be something along the lines of cuda_9.1.128_mac.dmg
 #### Install  NCCL
 Download `NCCL 2.1.15 O/S agnostic and CUDA 9` from [NVdia](https://developer.nvidia.com/nccl/nccl-download).
 
-Unarchive it and move correspondent file to `/usr/local/cuda`.
+Unarchive it and move to a permanant place e.g. `/usr/local/nccl`. 
+``` bash
+sudo mkdir -p /usr/local/nccl
+cd nccl_2.1.15-1+cuda9.1_x86_64
+sudo mv * /usr/local/nccl
+sudo mkdir -p /usr/local/include/third_party/nccl
+sudo ln -s /usr/local/nccl/include/nccl.h /usr/local/include/third_party/nccl
+```
+
 
 #### Set up your env paths
 
@@ -98,12 +106,13 @@ export PATH=$DYLD_LIBRARY_PATH:$PATH:/Developer/NVIDIA/CUDA-9.1/bin
 
 #### Compile Samples
 We want to compile some CUDA sample to check if the GPU is correctly recognized and supported.
+``` bash
+cd /Developer/NVIDIA/CUDA-9.1/samples
+chown -R $(whoami) *
+make -C 1_Utilities/deviceQuery
+./bin/x86_64/darwin/release/deviceQuery
+```
 ```
-$ cd /Developer/NVIDIA/CUDA-9.1/samples
-$ chown -R $(whoami) *
-$ make -C 1_Utilities/deviceQuery
-$ ./bin/x86_64/darwin/release/deviceQuery
-
  CUDA Device Query (Runtime API) version (CUDART static linking)
 
 Detected 1 CUDA Capable device(s)
@@ -153,10 +162,10 @@ Download [cuDNN 7.0.5](https://developer.nvidia.com/compute/machine-learning/cud
 
 Change into your download directory and follow the post installation steps.
 ``` bash
-$ tar -xzvf cudnn-9.1-osx-x64-v7-ga.tgz
-$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
-$ sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
-$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*
+tar -xzvf cudnn-9.1-osx-x64-v7-ga.tgz
+sudo cp cuda/include/cudnn.h /usr/local/cuda/include
+sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
+sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*
 ```
 
 
@@ -171,39 +180,40 @@ $ which pip
 
 Or
 Download [get-pip](https://bootstrap.pypa.io/get-pip.py) and run it in python. More info [here](https://pip.pypa.io/en/stable/installing/)
-```
+``` bash
 python get-pip.py
 ```
 pip will automatically install the tensorflow dependencies (wheel, six etc), if don't you could install them manually.
 
 
 ## Compile
 #### Clone TensorFlow from Repository
-```
-$ cd /tmp
-$ git clone https://github.com/tensorflow/tensorflow
-$ cd tensorflow
-$ git checkout v1.7.0
+``` bash
+cd /tmp
+git clone https://github.com/tensorflow/tensorflow
+cd tensorflow
+git checkout v1.8.0
 ```
 
 #### Apply Patch
-Apply the following [patch](https://gist.github.com/Willian-Zhang/088e017774536880bd425178b46b8c17#file-xtensorflow17macos-patch) to fix a couple build issues:
+Apply the following [patch](https://gist.github.com/Willian-Zhang/a3bd10da2d8b343875f3862b2a62eb3b#file-xtensorflow18macos.patch) to fix a couple build issues:
 
-```
-$ wget https://gist.github.com/Willian-Zhang/088e017774536880bd425178b46b8c17/raw/xtensorflow17macos.patch
-$ git apply xtensorflow17macos.patch
+``` bash
+wget https://gist.github.com/Willian-Zhang/a3bd10da2d8b343875f3862b2a62eb3b/raw/xtensorflow18macos.patch
+git apply xtensorflow17macos.patch
 ```
 
 
-
 #### Configure Build
 Except *CUDA support*, *CUDA SDK version* and *Cuda compute capabilities*, I left the other settings untouched.
 
 Pay attension to `Cuda compute capabilities`, you might want to find your own according to guide.
 
 
 ``` bash
-$ ./configure
+./configure
+```
+```
 You have bazel 0.10.0 installed.
 Please specify the location of python. [Default is /usr/bin/python]: 
 
@@ -282,26 +292,27 @@ Configuration finished
 Takes about 47 minutes on my machine.
 
 ``` bash
-$ bazel build --config=cuda --config=opt --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package
+bazel clean
+bazel build --config=cuda --config=opt --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package
 ```
 
 #### Create wheel file and install it
 
 ``` bash
-$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
-$ ls ls /tmp/tensorflow_pkg
-tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl
+bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
+ls /tmp/tensorflow_pkg
+tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl
 ```
 
 If you want to use virtualenv or something, now is the time. Or just:
 ``` bash
-$ pip install /tmp/tensorflow_pkg/tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl
+pip install /tmp/tensorflow_pkg/tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl
 ```
 
 #### Backup your wheel if nothing goes wrong (Optional)
 
 Files in `/tmp` would be cleaned after reboot.
-```
+``` bash
 cp /tmp/tensorflow_pkg/*.whl ~/
 ```
 
@@ -310,8 +321,10 @@ It's useful to leave the .whl file lying around in case you want to install it f
 #### Test Installation
 See if everything got linked correctly
 ``` bash
-$ cd ~
-$ python
+cd ~
+python
+```
+``` python
 >>> import tensorflow as tf
 >>> tf.Session()
 2018-04-08 03:25:15.740635: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
@@ -329,7 +342,7 @@ totalMemory: 11.00GiB freeMemory: 10.18GiB
 
 ##### Try out new Tensorflow feature (Optional)
 ``` bash
-$ python
+python
 ```
 ``` python
 import tensorflow as tf
@@ -343,10 +356,12 @@ print("hello, {}".format(m))  # => "hello, [[4.]]"
 
 #### Test GPU Acceleration
 
+```bash
+pip install keras
+wget https://gist.github.com/Willian-Zhang/290dceb96679c8f413e42491c92722b0/raw/mnist-cnn.py
+python mnist_cnn.py
+```
 ```
-$ pip install keras
-$ wget https://gist.github.com/Willian-Zhang/290dceb96679c8f413e42491c92722b0/raw/mnist-cnn.py
-$ python mnist_cnn.py
 /usr/local/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
   from ._conv import register_converters as _register_converters
 Using TensorFlow backend.

diff --git a/xtensorflow18macos.patch b/xtensorflow18macos.patch
@@ -0,0 +1,99 @@
+diff --git a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
+index 0f7adaf24a..934ccbada6 100644
+--- a/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
++++ b/tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc
+@@ -69,7 +69,7 @@ __global__ void concat_variable_kernel(
+   IntType num_inputs = input_ptr_data.size;
+
+   // verbose declaration needed due to template
+-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
++  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
+   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);
+
+   if (useSmem) {
+diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+index 94989089ec..1d26d4bacb 100644
+--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
++++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+@@ -172,7 +172,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
+     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
+   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
+   // Holds block plus halo and filter data for blockDim.x depths.
+-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
++  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
+   T* const shared_data = reinterpret_cast<T*>(shared_memory);
+
+   const int num_batches = args.batch;
+@@ -452,7 +452,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
+     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
+   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
+   // Holds block plus halo and filter data for blockDim.z depths.
+-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
++  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
+   T* const shared_data = reinterpret_cast<T*>(shared_memory);
+
+   const int num_batches = args.batch;
+@@ -1118,7 +1118,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
+     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
+   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
+   // Holds block plus halo and filter data for blockDim.x depths.
+-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
++  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
+   T* const shared_data = reinterpret_cast<T*>(shared_memory);
+
+   const int num_batches = args.batch;
+@@ -1388,7 +1388,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
+     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
+   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
+   // Holds block plus halo and filter data for blockDim.z depths.
+-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
++  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char shared_memory[];
+   T* const shared_data = reinterpret_cast<T*>(shared_memory);
+
+   const int num_batches = args.batch;
+diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
+index 393818730b..58a1294005 100644
+--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
++++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
+@@ -121,7 +121,7 @@ __global__ void split_v_kernel(const T* input_ptr,
+   int num_outputs = output_ptr_data.size;
+
+   // verbose declaration needed due to template
+-  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
++  extern __shared__ __align__(sizeof(T) > 16 ? sizeof(T) : 16) unsigned char smem[];
+   IntType* smem_col_scan = reinterpret_cast<IntType*>(smem);
+
+   if (useSmem) {
+diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
+index 0ce5cda517..d4dc2235ac 100644
+--- a/tensorflow/workspace.bzl
++++ b/tensorflow/workspace.bzl
+@@ -361,11 +361,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
+   tf_http_archive(
+       name = "protobuf_archive",
+       urls = [
+-          "https://mirror.bazel.build/github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
+-          "https://github.com/google/protobuf/archive/396336eb961b75f03b25824fe86cf6490fb75e3a.tar.gz",
++          "https://mirror.bazel.build/github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
++          "https://github.com/dtrebbien/protobuf/archive/50f552646ba1de79e07562b41f3999fe036b4fd0.tar.gz",
+       ],
+-      sha256 = "846d907acf472ae233ec0882ef3a2d24edbbe834b80c305e867ac65a1f2c59e3",
+-      strip_prefix = "protobuf-396336eb961b75f03b25824fe86cf6490fb75e3a",
++      sha256 = "eb16b33431b91fe8cee479575cee8de202f3626aaf00d9bf1783c6e62b4ffbc7",
++      strip_prefix = "protobuf-50f552646ba1de79e07562b41f3999fe036b4fd0",
+   )
+
+   # We need to import the protobuf library under the names com_google_protobuf
+diff --git a/third_party/gpus/cuda/BUILD.tpl b/third_party/gpus/cuda/BUILD.tpl
+index 2a37c65bc7..43446dd99b 100644
+--- a/third_party/gpus/cuda/BUILD.tpl
++++ b/third_party/gpus/cuda/BUILD.tpl
+@@ -110,7 +110,7 @@ cc_library(
+         ".",
+         "cuda/include",
+     ],
+-    linkopts = ["-lgomp"],
++    #linkopts = ["-lgomp"],
+     linkstatic = 1,
+     visibility = ["//visibility:public"],
+ )
diff --git a/tensorflow_1_8_high_sierra_gpu.md b/tensorflow_1_8_high_sierra_gpu.md
@@ -1,4 +1,4 @@
-# Tensorflow 1.7 with CUDA on macOS High Sierra 10.13.4 for eGPU
+# Tensorflow 1.8 with CUDA on macOS High Sierra 10.13.4 for eGPU
 
 Largely based on the [Tensorflow 1.6 gist](https://gist.github.com/mattiasarro/1f3498a26ad111a8d99199eaf64551be), 
 and [Tensorflow 1.7 gist for xcode](https://gist.github.com/pavelmalik/d51036d508c8753c86aed1f3ff1e6967), this should hopefully simplify things a bit.
@@ -43,7 +43,6 @@ $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/inst
 $ brew install wget
 ```
 
-
 #### NVIDIA Graphics driver
 
 Download and install from http://www.nvidia.com/download/driverResults.aspx/130460/en-us
@@ -61,7 +60,7 @@ Or Find `XCode 8.2` on https://developer.apple.com/download/more/
 Unarchive and rename `XCode.app` to `Xcode8.2.app` in case you want to build and use it next time.
 
 
-#### Install Bazel 0.10
+#### Install Bazel
 
 If you have Homebrew installed
 ```
@@ -75,14 +74,15 @@ $ ./bazel-0.10.0-installer-darwin-x86_64.sh
 ```
 
 
-
-
 #### Install CUDA Toolkit 9.1
 [Download CUDA-9.1](https://developer.nvidia.com/cuda-downloads?target_os=MacOSX&target_arch=x86_64&target_version=1013&target_type=dmglocal)
 
 It should be something along the lines of cuda_9.1.128_mac.dmg
 
+#### Install  NCCL
+Download `NCCL 2.1.15 O/S agnostic and CUDA 9` from [NVdia](https://developer.nvidia.com/nccl/nccl-download).
 
+Unarchive it and move correspondent file to `/usr/local/cuda`.
 
 #### Set up your env paths
 

diff --git a/tensorflow_1_8_high_sierra_gpu.md b/tensorflow_1_8_high_sierra_gpu.md
@@ -0,0 +1,411 @@
+# Tensorflow 1.7 with CUDA on macOS High Sierra 10.13.4 for eGPU
+
+Largely based on the [Tensorflow 1.6 gist](https://gist.github.com/mattiasarro/1f3498a26ad111a8d99199eaf64551be), 
+and [Tensorflow 1.7 gist for xcode](https://gist.github.com/pavelmalik/d51036d508c8753c86aed1f3ff1e6967), this should hopefully simplify things a bit.
+
+## Requirements
+
+* NVIDIA Web-Drivers 387.10.10.10.30.106 for 10.13.4 (17E199) __(w/o Security Update)__
+* CUDA-Drivers 387.128
+* CUDA 9.1 Toolkit
+* cuDNN 7.0.5 __(latest for macOS)__
+* NCCL 2.1.15 __(latest for macOS)__
+* Python 2.7
+* XCode 8.2
+* bazel stable 0.13.0 __(latest on HomeBrew)__
+* Tensorflow 1.8 Source Code
+
+
+## eGPU Only
+#### Checkout eGPU setup before install (required for eGPU, ignore if other)
+If you don't know how to setup eGPU on Mac checkout [these step](https://egpu.io/forums/mac-setup/script-enable-egpu-on-tb1-2-macs-on-macos-10-13-4/paged/6/#post-33535).
+Make sure you have eGPU working before installation. 
+(You sould see your specific graphic card name in Apple > About this Mac > System Report ... > Graphics/Displays)
+
+The rest steps are the same as normal GPU setup.
+
+## Prepare
+#### Check and use pre-compiliation (Optional, Risky, Please Skip if you don't understand)
+If you are like me using MacBook Pro (15-inch, 2016) runing 10.13.4 (17E199) and eGPU: NVIDIA GeForce GTX 1080 Ti 11 GiB (or any 6.1 compatible version in [nvidia page](https://developer.nvidia.com/cuda-gpus)). 
+You could, at your own risk, skip the `Prepare` and `Compile` steps below, 
+[download .whl from here](https://github.com/Willian-Zhang/tensorflow-precompile/raw/r1.7/tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl) and install it:
+``` bash
+$ pip install tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl
+```
+
+And be sure to test after installation.
+But remember this is **not safe**.
+
+#### Install Homwbrew (Optional)
+For package management, ignore if you have your own `python`, `wget` or you want to download manually.
+``` bash
+$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
+$ brew install wget
+```
+
+
+#### NVIDIA Graphics driver
+
+Download and install from http://www.nvidia.com/download/driverResults.aspx/130460/en-us
+
+#### NVIDIA Cuda driver
+
+Download and install from http://www.nvidia.com/object/macosx-cuda-387.178-driver.html
+
+
+#### Install XCode 8.2
+
+Download and from [XCode_8.2.xip](https://download.developer.apple.com/Developer_Tools/Xcode_8.2/Xcode_8.2.xip).
+Or Find `XCode 8.2` on https://developer.apple.com/download/more/
+
+Unarchive and rename `XCode.app` to `Xcode8.2.app` in case you want to build and use it next time.
+
+
+#### Install Bazel 0.10
+
+If you have Homebrew installed
+```
+$ brew install bazel
+```
+or Download the binary [here](https://github.com/bazelbuild/bazel/releases/download/0.10.0/bazel-0.10.0-installer-darwin-x86_64.sh)
+
+```
+$ chmod 755 bazel-0.10.0-installer-darwin-x86_64.sh
+$ ./bazel-0.10.0-installer-darwin-x86_64.sh
+```
+
+
+
+
+#### Install CUDA Toolkit 9.1
+[Download CUDA-9.1](https://developer.nvidia.com/cuda-downloads?target_os=MacOSX&target_arch=x86_64&target_version=1013&target_type=dmglocal)
+
+It should be something along the lines of cuda_9.1.128_mac.dmg
+
+
+
+#### Set up your env paths
+
+Edit `~/.bash_profile` and add the following:
+
+```
+export CUDA_HOME=/usr/local/cuda
+export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib 
+export LD_LIBRARY_PATH=$DYLD_LIBRARY_PATH
+export PATH=$DYLD_LIBRARY_PATH:$PATH:/Developer/NVIDIA/CUDA-9.1/bin
+```
+
+
+#### Compile Samples
+We want to compile some CUDA sample to check if the GPU is correctly recognized and supported.
+```
+$ cd /Developer/NVIDIA/CUDA-9.1/samples
+$ chown -R $(whoami) *
+$ make -C 1_Utilities/deviceQuery
+$ ./bin/x86_64/darwin/release/deviceQuery
+
+ CUDA Device Query (Runtime API) version (CUDART static linking)
+
+Detected 1 CUDA Capable device(s)
+
+Device 0: "GeForce GTX 1080 Ti"
+  CUDA Driver Version / Runtime Version          9.1 / 9.1
+  CUDA Capability Major/Minor version number:    6.1
+  Total amount of global memory:                 11264 MBytes (11810963456 bytes)
+  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
+  GPU Max Clock rate:                            1645 MHz (1.64 GHz)
+  Memory Clock rate:                             5505 Mhz
+  Memory Bus Width:                              352-bit
+  L2 Cache Size:                                 2883584 bytes
+  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
+  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
+  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
+  Total amount of constant memory:               65536 bytes
+  Total amount of shared memory per block:       49152 bytes
+  Total number of registers available per block: 65536
+  Warp size:                                     32
+  Maximum number of threads per multiprocessor:  2048
+  Maximum number of threads per block:           1024
+  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
+  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
+  Maximum memory pitch:                          2147483647 bytes
+  Texture alignment:                             512 bytes
+  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
+  Run time limit on kernels:                     Yes
+  Integrated GPU sharing Host Memory:            No
+  Support host page-locked memory mapping:       Yes
+  Alignment requirement for Surfaces:            Yes
+  Device has ECC support:                        Disabled
+  Device supports Unified Addressing (UVA):      Yes
+  Supports Cooperative Kernel Launch:            Yes
+  Supports MultiDevice Co-op Kernel Launch:      No
+  Device PCI Domain ID / Bus ID / location ID:   0 / 196 / 0
+  Compute Mode:
+     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
+
+deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1
+Result = PASS
+```
+
+#### NVIDIA cuDNN - Deep Learning Primitives 
+If not already done, register at [https://developer.nvidia.com/cudnn](https://developer.nvidia.com/cudnn) 
+Download [cuDNN 7.0.5](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.1_20171129/cudnn-9.1-osx-x64-v7-ga)
+
+Change into your download directory and follow the post installation steps.
+``` bash
+$ tar -xzvf cudnn-9.1-osx-x64-v7-ga.tgz
+$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
+$ sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
+$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*
+```
+
+
+#### Install pip for python 2.7 (Optional)
+Skip if you have your own idea of which python/pip to use:
+``` bash
+$ which python
+/usr/local/bin/python
+$ which pip
+/usr/local/bin/pip
+```
+
+Or
+Download [get-pip](https://bootstrap.pypa.io/get-pip.py) and run it in python. More info [here](https://pip.pypa.io/en/stable/installing/)
+```
+python get-pip.py
+```
+pip will automatically install the tensorflow dependencies (wheel, six etc), if don't you could install them manually.
+
+
+## Compile
+#### Clone TensorFlow from Repository
+```
+$ cd /tmp
+$ git clone https://github.com/tensorflow/tensorflow
+$ cd tensorflow
+$ git checkout v1.7.0
+```
+
+#### Apply Patch
+Apply the following [patch](https://gist.github.com/Willian-Zhang/088e017774536880bd425178b46b8c17#file-xtensorflow17macos-patch) to fix a couple build issues:
+
+```
+$ wget https://gist.github.com/Willian-Zhang/088e017774536880bd425178b46b8c17/raw/xtensorflow17macos.patch
+$ git apply xtensorflow17macos.patch
+```
+
+
+
+#### Configure Build
+Except *CUDA support*, *CUDA SDK version* and *Cuda compute capabilities*, I left the other settings untouched.
+
+Pay attension to `Cuda compute capabilities`, you might want to find your own according to guide.
+
+
+``` bash
+$ ./configure
+You have bazel 0.10.0 installed.
+Please specify the location of python. [Default is /usr/bin/python]: 
+
+
+Found possible Python library paths:
+  /Library/Python/2.7/site-packages
+Please input the desired Python library path to use.  Default is [/Library/Python/2.7/site-packages]
+
+Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]:
+No Google Cloud Platform support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Hadoop File System support? [Y/n]:
+No Hadoop File System support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]:
+No Amazon S3 File System support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]:
+No Apache Kafka Platform support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with XLA JIT support? [y/N]:
+No XLA JIT support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with GDR support? [y/N]:
+No GDR support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with VERBS support? [y/N]:
+No VERBS support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
+No OpenCL SYCL support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with CUDA support? [y/N]: y
+CUDA support will be enabled for TensorFlow.
+
+Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1
+
+
+Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
+
+
+Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 
+
+
+Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
+
+
+Please specify a list of comma-separated Cuda compute capabilities you want to build with.
+You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
+Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2] (type your own, check on https://developer.nvidia.com/cuda-gpus, mine is 6.1 for GTX 1080 Ti)
+
+
+Do you want to use clang as CUDA compiler? [y/N]:
+nvcc will be used as CUDA compiler.
+
+Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 
+
+
+Do you wish to build TensorFlow with MPI support? [y/N]:
+No MPI support will be enabled for TensorFlow.
+
+Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 
+
+
+Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
+Not configuring the WORKSPACE for Android builds.
+
+Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
+	--config=mkl         	# Build with MKL support.
+	--config=monolithic  	# Config for mostly static monolithic build.
+Configuration finished
+
+```
+
+#### Build Process
+Takes about 47 minutes on my machine.
+
+``` bash
+$ bazel build --config=cuda --config=opt --action_env PATH --action_env LD_LIBRARY_PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package
+```
+
+#### Create wheel file and install it
+
+``` bash
+$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
+$ ls ls /tmp/tensorflow_pkg
+tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl
+```
+
+If you want to use virtualenv or something, now is the time. Or just:
+``` bash
+$ pip install /tmp/tensorflow_pkg/tensorflow-1.7.0-cp36-cp36m-macosx_10_7_x86_64.whl
+```
+
+#### Backup your wheel if nothing goes wrong (Optional)
+
+Files in `/tmp` would be cleaned after reboot.
+```
+cp /tmp/tensorflow_pkg/*.whl ~/
+```
+
+It's useful to leave the .whl file lying around in case you want to install it for another environment.
+
+#### Test Installation
+See if everything got linked correctly
+``` bash
+$ cd ~
+$ python
+>>> import tensorflow as tf
+>>> tf.Session()
+2018-04-08 03:25:15.740635: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
+2018-04-08 03:25:15.741260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
+name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
+pciBusID: 0000:c4:00.0
+totalMemory: 11.00GiB freeMemory: 10.18GiB
+2018-04-08 03:25:15.741288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
+2018-04-08 03:25:16.157590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
+2018-04-08 03:25:16.157614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
+2018-04-08 03:25:16.157620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
+2018-04-08 03:25:16.157753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9849 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:c4:00.0, compute capability: 6.1)
+<tensorflow.python.client.session.Session object at 0x10968ef60>
+```
+
+##### Try out new Tensorflow feature (Optional)
+``` bash
+$ python
+```
+``` python
+import tensorflow as tf
+tf.enable_eager_execution()
+tf.executing_eagerly()        # => True
+
+x = [[2.]]
+m = tf.matmul(x, x)
+print("hello, {}".format(m))  # => "hello, [[4.]]"
+```
+
+#### Test GPU Acceleration
+
+```
+$ pip install keras
+$ wget https://gist.github.com/Willian-Zhang/290dceb96679c8f413e42491c92722b0/raw/mnist-cnn.py
+$ python mnist_cnn.py
+/usr/local/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
+  from ._conv import register_converters as _register_converters
+Using TensorFlow backend.
+x_train shape: (60000, 28, 28, 1)
+60000 train samples
+10000 test samples
+Train on 60000 samples, validate on 10000 samples
+Epoch 1/12
+2018-04-08 03:29:00.155517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
+2018-04-08 03:29:00.155661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
+name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
+pciBusID: 0000:c4:00.0
+totalMemory: 11.00GiB freeMemory: 10.11GiB
+2018-04-08 03:29:00.155677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
+2018-04-08 03:29:00.562343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
+2018-04-08 03:29:00.562373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
+2018-04-08 03:29:00.562403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
+2018-04-08 03:29:00.562536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9781 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:c4:00.0, compute capability: 6.1)
+2018-04-08 03:29:00.563022: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 9.55G (10256140800 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
+2018-04-08 03:29:00.868307: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+2018-04-08 03:29:00.906005: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+2018-04-08 03:29:00.973462: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+59904/60000 [============================>.] - ETA: 0s - loss: 0.2624 - acc: 0.92022018-04-08 03:29:07.381067: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
+60000/60000 [==============================] - 8s 129us/step - loss: 0.2620 - acc: 0.9203 - val_loss: 0.0587 - val_acc: 0.9825
+Epoch 2/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0891 - acc: 0.9733 - val_loss: 0.0437 - val_acc: 0.9850
+Epoch 3/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0681 - acc: 0.9789 - val_loss: 0.0341 - val_acc: 0.9881
+Epoch 4/12
+60000/60000 [==============================] - 4s 67us/step - loss: 0.0569 - acc: 0.9829 - val_loss: 0.0398 - val_acc: 0.9859
+Epoch 5/12
+60000/60000 [==============================] - 4s 70us/step - loss: 0.0480 - acc: 0.9856 - val_loss: 0.0303 - val_acc: 0.9898
+Epoch 6/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0438 - acc: 0.9869 - val_loss: 0.0288 - val_acc: 0.9897
+Epoch 7/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0379 - acc: 0.9881 - val_loss: 0.0287 - val_acc: 0.9905
+Epoch 8/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0357 - acc: 0.9892 - val_loss: 0.0277 - val_acc: 0.9915
+Epoch 9/12
+60000/60000 [==============================] - 4s 65us/step - loss: 0.0329 - acc: 0.9898 - val_loss: 0.0268 - val_acc: 0.9906
+Epoch 10/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0312 - acc: 0.9903 - val_loss: 0.0295 - val_acc: 0.9911
+Epoch 11/12
+60000/60000 [==============================] - 4s 66us/step - loss: 0.0281 - acc: 0.9908 - val_loss: 0.0292 - val_acc: 0.9908
+Epoch 12/12
+60000/60000 [==============================] - 4s 65us/step - loss: 0.0277 - acc: 0.9917 - val_loss: 0.0260 - val_acc: 0.9919
+Test loss: 0.02598250026818114
+Test accuracy: 0.9919
+```
+
+
+You can use [cuda-smi](https://github.com/phvu/cuda-smi) to watch the GPU memory usages. In case the of the mnist example in keras, you should see the free memory drop down to maybe 2% and the fans spin up. Not quite sure what the grappler/clusters/utils.cc:127 warning is, however.
+
+```
+$ ./cuda-smi.dms 
+Device 0 [PCIe 0:196:0.0]: GeForce GTX 1080 Ti (CC 6.1): 10350 of 11264 MB (i.e. 91.9%) Free
+# when GPU
+$ ./cuda-smi.dms 
+Device 0 [PCIe 0:196:0.0]: GeForce GTX 1080 Ti (CC 6.1): 1181.1 of 11264 MB (i.e. 10.5%) Free
+```
+
+Tested on a MacBook Pro (15-inch, 2016) 10.13.4 (17E199) 2.7 GHz Intel Core i7 and NVIDIA GeForce GTX 1080 Ti 11 GiB