Forked from bhavikngala/fast_ai_mooc_important_points.md
Created
March 4, 2019 16:31
-
-
Save raivivek/a402c1d0a6074034d4c33f76b1d66de9 to your computer and use it in GitHub Desktop.
Revisions
-
bhavikngala revised this gist
Feb 20, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,7 +4,7 @@ Before beginning, I want to thank Jeremy Howard, Rachel Thomas, and the entire f 1. Progressive image resolution training: Train the network on lower res first and then increase the resolution to get better performance. This can be thought of as transfer learning from the same dataset but at a different resolution. There is one paper by NVIDIA as well that used such an approach to train GANs. 2. Cyclical learning rates: Gradually increasing the learning rate initially helps to avoid getting stuck in saddle points and explore entire(or more areas) of the loss landscape. [https://arxiv.org/abs/1506.01186] 3. To reduce memory usage you can use lower precision floating points i.e. float16 instead of float32. -
bhavikngala revised this gist
Feb 20, 2019 . 1 changed file with 5 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -53,7 +53,7 @@ out = sigmoid(x) * (max_range - min_range) + min_range 24. When using transfer learning, you should use the stats of the dataset on which the model was trained to normalization your dataset. 25. Paper read: Visualizing loss landscape of neural networks. [https://arxiv.org/abs/1712.09913] 26. Densenet works very well for smaller datasets and on segmentation task. Resnet works very well on the segmentation task as well. @@ -126,7 +126,7 @@ class SomeLoss(nn.Module): 54. Don't implement a paper mindlessly. You can have ideas that the authors didn't have. 55. Paper read: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. [https://arxiv.org/abs/1803.09820] 56. Google's Fire library. @@ -148,13 +148,13 @@ class SomeLoss(nn.Module): 65. In neural networks, replace all the operations in forward function with their `_` version, for example, replace `+` with `add_` to perform operations in place and save GPU memory. 66. Paper read: Wide residual networks. [https://arxiv.org/abs/1605.07146]. The fast.ai team got the first place on DAWN benchmark. 67. Topic read: Stochastic weight average. 68. Fastai train phase API. 69. Paper read: LARS. [https://arxiv.org/abs/1708.03888] 70. You can train the model with different optimizers during different training phases. @@ -163,7 +163,7 @@ TODO: insert image. 72. The very initial stage of the backbone network where the input channels say 3 are increased to higher numbers say 64 is called stem of the backbone network. Inception network stem is very good then other networks. One can try inception stem on the resnet main backbone. 73. Paper read: Progressive growing of GANs. [https://arxiv.org/abs/1710.10196] 74. The most interesting layers to grab output from are the ones before the max pooling layer because they represent the data best before the grid size changes. -
bhavikngala revised this gist
Feb 12, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ This gist contains a list of points I found very useful while watching the fast.ai "Practical deep learning for coders" and "Cutting edge deep learning for coders" MOOC by Jeremy Howard and team. This list may not be complete as I watched the video at 1.5x speed on marathon but I did write down as many things I found to be very useful to get a model working. A fair warning the points are in no particular order, you may find the topics are all jumbled up. Before beginning, I want to thank Jeremy Howard, Rachel Thomas, and the entire fast.ai team in making this awesome practically oriented MOOC. -
bhavikngala revised this gist
Feb 12, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -158,7 +158,7 @@ class SomeLoss(nn.Module): 70. You can train the model with different optimizers during different training phases. 71. You can break 7x7 filter to two 1x7 and 7x1 filter: linearly separable filters. This reduces computations. TODO: insert image. 72. The very initial stage of the backbone network where the input channels say 3 are increased to higher numbers say 64 is called stem of the backbone network. Inception network stem is very good then other networks. One can try inception stem on the resnet main backbone. -
bhavikngala revised this gist
Feb 12, 2019 . 2 changed files with 171 additions and 99 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,99 +0,0 @@ This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,171 @@ This gist contains a list of points I found very useful while going the fast.ai "Practical deep learning for coders" and "Cutting edge deep learning for coders" MOOC by Jeremy Howard and team. This list may not be complete as I watched the video at 1.5x speed on marathon but I did write down as many things I found to be very useful to get a model working. A fair warning the points are in no particular order, you may find everything the topics are all jumbled up. Before beginning, I want to thank Jeremy Howard, Rachel Thomas, and the entire fast.ai team in making this awesome practically oriented MOOC. 1. Progressive image resolution training: Train the network on lower res first and then increase the resolution to get better performance. This can be thought of as transfer learning from the same dataset but at a different resolution. There is one paper by NVIDIA as well that used such an approach to train GANs. 2. Cyclical learning rates: Gradually increasing the learning rate initially helps to avoid getting stuck in saddle points and explore entire(or more areas) of the loss landscape. 3. To reduce memory usage you can use lower precision floating points i.e. float16 instead of float32. 4. Self-supervised learning - labels are inbuilt in data. 5. For NLP tasks other that language models, you can use language model for transfer learning i.e. first train the model to be a language model and then add the actual functionality. 5. When using transfer learning for NLP, in language model you can and should use entire dataset i.e. train and test datasets. 6. Discriminative learning rates: use different learning rates for different layer groups in your network. 7. Random forests can be used to find optimal hyperparameters. 8. Use embeddings for categorical variables. 9. For missing values - replace them with the median of the variable and add a new column of boolean variable saying missing=True/False. 10. Wherever possible use transfer learning, it always increases performance. 11. You can give a range to sigmoid function in last layer, it can increase the model performce. ``` out = sigmoid(x) * (max_range - min_range) + min_range ``` 13. Complexity is not measured by the number of parameters. 14. You can use the data/time data given in the dataset to extract various useful information like the day of the week, the day of the month, the day of the year, year, month, week, is it a holiday, etc. It is useful in cases like detecting patterns like a certain event increased because it was a payday or holiday etc. 15. More data is always useful. 16. Too much dropout reduces the capacity of the network, experiment with multiple values. 17. You can apply dropout to the output of the embedding layer too. 18. Batch normalization helps to smoothen the loss landscape thus allowing higher learning rates. 19. Reflection padding in case of zero padding works better. 20. Larger kernel for the first layer in CNN is better since the number of channels is just 3(or very very less) in the beginning. 21. `t[None]` can add a new dimension to the tensor, i.e. convert 3D tensor to 4D tensor. 22. Use forward hooks to grab outputs of intermediate layers, simplifies implementations of pyramid style networks a lot. 23. Ethics in AI: the privileged are processed by people and the poor are processed by algorithms - Cathy O'Neil 24. When using transfer learning, you should use the stats of the dataset on which the model was trained to normalization your dataset. 25. Paper read: Visualizing loss landscape of neural networks. 26. Densenet works very well for smaller datasets and on segmentation task. Resnet works very well on the segmentation task as well. 27. You can apply modern methods on old papers and get SOTA results. 28. A new UNET style network: resnet34 + subpixel convolutional upsampling. 29. Subpixel convolutions for upsampling: a lot of improvement in removing checkerboard artifacts. 30. Pretrained discriminator and generator in GAN. 31. Spectral normalization in GAN. 32. Don't use momentum in GANs, they don't like it. 33. Loss value for generator and discriminator should converge, the only way to confirm GAN training is by visual inspection. 34. Perceptual losses for style transfer and super resolutions. 35. Say there is a network with complex loss function or a loss function which requires intermediate layer outputs, then do this: ``` class SomeLoss(nn.Module): def __init__(self, network, ...): self.net = network self.hooks = ... # apply hooks to the networks to get intermediate layers outputs # additional statements def forward(self, x, target): y_hat = self.net(x) # intermediate outputs are in self.hooks # compute the loss return loss ``` 36. Gradual unfreezing: Take trained model -> replace last layers -> fine tune last layer -> fine tune earlier layers. 37. Five steps to avoid overfitting: More data -> data augmentation -> generalizable architecture -> regularization -> reduce architecture complexity. 38. Use lambda functions to reduce lines of code wherever possible. 39. Functions should be 5 lines or less wherever possible. 40. Python debugger: pdb - useful commands \[s, n, l, c, u, h, p\]. 41. In case of multiple losses, find a multiplier to make all the losses approximately equal. 42. Batch Norm after ReLU makes better sense since BN normalizes activations and ReLU after BN will shift mean and var. 43. BN should not be used right after the dropout layer. 44. Receptive field. 45. Chunk size in pandas.dataframe to get iterator on large datasets. 46. NLP tokenization: the beginning of sentence token, field token, when converting to UPPER case to lower case then add a token denoting UPPER case before the word. 47. Limit vocabulary to ~60000 words, remove tokens that do not appear more than 2 times. 48. For NLP tasks, the model can be trained on a subset of Wikipedia articles. Model pretraining. 49. `wget -r` 50. Command line tools can be run in jupyter notebook by placing `!` before them. 51. Since sequences cannot be randomly shuffled, we can vary the length of the sequence to add randomness. 52. `perplexity = exp(cross_entropy)` 53. Accuracy can be used in NLP as a metric. 54. Don't implement a paper mindlessly. You can have ideas that the authors didn't have. 55. Paper read: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. 56. Google's Fire library. 57. VNC port forwarding to access jupyter notebook on servers. 58. `!pip install git+URL` for installing lib from git. 59. To free CNN from the input image size use adaptive average pooling. 60. Using bi-directional LSTM in the seq2seq model improves performance. Teacher enforcing and attention. 61. In high dimensional spaces, everything is on the edge, and thus distance does not matter but the angle matters. Thus cosine similarity loss is way better that L1/L2 loss. 62. `python nmslib` for the nearest neighbor query. 63. Get word vector of imagenet classes from wordnet -> train imagenet to predict word vectors -> now you have a search engine for images -> input word -> get word vector -> get images with similar word vectors. I apologize, I do not have the link for this paper. 64. In practice, LeakyReLU is useful for smaller datasets. 65. In neural networks, replace all the operations in forward function with their `_` version, for example, replace `+` with `add_` to perform operations in place and save GPU memory. 66. Paper read: Wide residual networks. The fast.ai team got the first place on DAWN benchmark. 67. Topic read: Stochastic weight average. 68. Fastai train phase API. 69. Paper read: LARS. 70. You can train the model with different optimizers during different training phases. 71. You can break 7x7 filter to 2 1x7 and 7x1 filter: linearly separable filters. This reduces computations. TODO: insert image. 72. The very initial stage of the backbone network where the input channels say 3 are increased to higher numbers say 64 is called stem of the backbone network. Inception network stem is very good then other networks. One can try inception stem on the resnet main backbone. 73. Paper read: Progressive growing of GANs. 74. The most interesting layers to grab output from are the ones before the max pooling layer because they represent the data best before the grid size changes. I may have missed some points and there may be some mistakes. I haven't included any paper citations but all of the above points are from the MOOC and the papers presented in the MOOC. -
bhavikngala revised this gist
Feb 12, 2019 . 1 changed file with 70 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -27,4 +27,73 @@ Before beggining I want to thank Jeremy Howard, Rachel Thomas, and the entire fa 11. You can give a range to sigmoid function in last layer, it can increase the model performce. ``` out = sigmoid(x) * (max_range - min_range) + min_range ``` 13. Complexity is not measured by the number of parameters. 14. You can use the data/time data given in the dataset to extract various useful information like day of the week, day of the month, day of the year, year, month, week, is it a holiday, etc. It is useful in cases like detecting patterns like a certain event increased because it was a payday or holiday etc. 15. More data is always useful. 16. Too much dropout reduces the capacity of the network, experiment with multiple values. 17. You can apply dropout to output of the embedding layer too. 18. Batch normalization helps to smoothen the loss landscape thus allowing higher learning rates. 19. Reflection padding in case of zero padding works better. 20. Larger kernel for first layer in CNN is better since the number of channels are just 3(or very very less) in the beginning. 21. `t[None]` can add a new dimension to the tensor, i.e. convert 3D tensor to 4D tensor. 22. Use forward hooks to grab outputs of intermediate layers, simplifies implementations of pyramid style networks alot. 23. Ethics in AI: the privelleged are processed by people and the poor are processed by algorithms - Cathy O'Neil 24. When using transfer learning, you should use the stats of the dataset on which the model was pretrained to normalization your dataset. 25. Paper read: Visualizing loss landscape of neural networks. 26. Densenet works very well for smaller datasets and on segmentation task. Resnet works very well on segmentation task as well. 27. You can apply modern methods on old papers and get SOTA results. 28. A new UNET style network: resnet34 + subpixel convolutional upsampling. 29. Subpixel convolutions for upsampling: alot of improvement in removing checkerboard artifacts. 30. Pretrained discriminator and generator in GAN. 31. Spectral normalization in GAN. 32. Dont use momentum in GANs, they dont like it. 33. Loss value for generator and discriminator should converge, only way to confirm GAN training is by visual inspection. 34. Perceptual losses for style transfer and super resolutions. 35. Say there is a network with complex loss function or a loss function which requires intermediate layer outputs, then do this: ``` class SomeLoss(nn.Module): def __init__(self, network, ...): self.net = network self.hooks = ... # apply hooks to the networks to get intermediate layers outputs # additional statements def forward(self, x, target): y_hat = self.net(x) # intermediate outputs are in self.hooks # compute the loss return loss ``` 36. Gradual unfreezing: Take pretrained model -> replace last layers -> fine tune last layer -> fine tune earlier layers. 37. Five steps to avoid overfitting: More data -> data augmentation -> generalizable architecture -> regularization -> reduce architecture complexity. 38. Use lambda functions to reduce lines of code wherever possible. 39. Functions should be 5 lines or less wherever possible. 40. Python debugger: pdb - useful commands \[s, n, l, c, u, h, p\] -
bhavikngala revised this gist
Feb 11, 2019 . 1 changed file with 7 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ This gist contains contains a list of points I found very useful while going the fast.ai "Practical deep learning for coders" and "Cutting edge deep learning for coders" MOOC by Jeremy Howard and team. This list may not be complete as I watched the video at 1.5x speed on marathon but I did write down as much things I found to be very useful to get a model working. A fair warning the points are in no particular order, you may find everything the topics are all jumbled up. Before beggining I want to thank Jeremy Howard, Rachel Thomas, and the entire fast.ai team in making this awesome practically oriented MOOC. @@ -22,4 +22,9 @@ Before beggining I want to thank Jeremy Howard, Rachel Thomas, and the entire fa 9. For missing values - replace them with the median of the variable and add a new column of boolean variable saying missing=True/False. 10. Wherever possible use transfer learning, it always increases performance. 11. You can give a range to sigmoid function in last layer, it can increase the model performce. ``` out = sigmoid(x) * (max_range - min_range) + min_range ``` -
bhavikngala renamed this gist
Feb 11, 2019 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
bhavikngala created this gist
Feb 11, 2019 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,25 @@ This gist contains contains a list of points I found very useful while going the fast.ai "Practical deep learning for coders" and "Cutting edge deep learning for coders" MOOC by Jeremy Howard and team. This list may not be complete as I watched the video at 1.5x speed on marathon but I did write down as much things I found to be very useful to get a model working. A fair warning the points are in no particular order, you may everything on python, CNN, NLP etc all jumbled up. Before beggining I want to thank Jeremy Howard, Rachel Thomas, and the entire fast.ai team in making this awesome practically oriented MOOC. 1. Progressive image res training: Train the network on lower res first and then increase the resolution to get better performance. This can be thought of as transfer learning from the same dataset but at a different resolution. There is one paper by NVIDIA as well that used such an approach to train GANs. 2. Cyclical learning rates: Gradually increasing the learning rate initially helps to avoid getting stuck in saddle points and explore entire(or more areas) of the loss landscape. 3. To reduce memory usage you can use lower precision floating points i.e. float16 instead of float32. 4. Self supervised learning - labels are inbuilt in data. 5. For NLP tasks other that language models, you can use language model for tranfer learning i.e. first train the model to be a language model and then add the actually functionality. 5. When using transfer learning for NLP, in language model you can and should use entire dataset i.e. train and test datasets. 6. Discriminative learning rates: use different learning rates for different layer groups in your network. 7. Random forests can be used to find optimal hyper parameters. 8. Use embeddings for categorical variables. 9. For missing values - replace them with the median of the variable and add a new column of boolean variable saying missing=True/False. 10. Wherever possible use transfer learning, it always increases performance.