Last active
June 10, 2024 03:29
-
-
Save ammarsaf/e25fec87b4586990a24b6922c59572e2 to your computer and use it in GitHub Desktop.
Revisions
-
ammarsaf revised this gist
Jun 10, 2024 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ # Unsolve training Yolov5 on Sagemaker all night issue [solved] Datetime: 1/6/2023, 10:25 am -
ammarsaf revised this gist
Jun 10, 2024 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ # Title: Unsolve training Yolov5 on Sagemaker all night issue [solved] Datetime: 1/6/2023, 10:25 am -
ammarsaf revised this gist
Jun 10, 2024 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -23,4 +23,5 @@ and wake up on the next morning monitoring it if it still running, or get the `. that is idle. ## Solution 1. The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up. 2. Spawn EC2 server to train the model. -
ammarsaf revised this gist
Sep 19, 2023 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -23,4 +23,4 @@ and wake up on the next morning monitoring it if it still running, or get the `. that is idle. ## Solution * The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up. -
ammarsaf revised this gist
Jun 1, 2023 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,5 @@ # Title: Unsolve training Yolov5 on Sagemaker all night issue Datetime: 1/6/2023, 10:25 am ## Observation -
ammarsaf created this gist
Jun 1, 2023 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,25 @@ # Details Datetime: 1/6/2023, 10:25 am ## Observation * What I want is to train the model all night on AWS. But running all night, I mean that I shut off my laptop and let it train on the cloud and wake up on the next morning monitoring it if it still running, or get the `.pt` file as the training completed. ## Issues 1. Cannot close the broswer and let it train 2. Instance is assumed to be idle even there is a training process is run ## Actions * I have tried 2 ways to solve a this issues: 1. Running directly on Notebook. * Status: Failed to run after I close the JupyterLab browser 2. Runing on termnial with `tmux` * Status: Failed * I've created a session and start the training. Then, I detached. * I closed the browswer and the training is still work. This solve the first issue. * However, after several epochs (monitored on ClearML), the training is failed. I presumed the Sagemaker assumed that that is idle. ## Solution * not yet found