Skip to content

Instantly share code, notes, and snippets.

@ammarsaf
Last active June 10, 2024 03:29
Show Gist options
  • Save ammarsaf/e25fec87b4586990a24b6922c59572e2 to your computer and use it in GitHub Desktop.
Save ammarsaf/e25fec87b4586990a24b6922c59572e2 to your computer and use it in GitHub Desktop.

Revisions

  1. ammarsaf revised this gist Jun 10, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion observation-train-yolov5-sagemaker.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Title: Unsolve training Yolov5 on Sagemaker all night issue [solved]
    # Unsolve training Yolov5 on Sagemaker all night issue [solved]

    Datetime: 1/6/2023, 10:25 am

  2. ammarsaf revised this gist Jun 10, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion observation-train-yolov5-sagemaker.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Title: Unsolve training Yolov5 on Sagemaker all night issue
    # Title: Unsolve training Yolov5 on Sagemaker all night issue [solved]

    Datetime: 1/6/2023, 10:25 am

  3. ammarsaf revised this gist Jun 10, 2024. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion observation-train-yolov5-sagemaker.md
    Original file line number Diff line number Diff line change
    @@ -23,4 +23,5 @@ and wake up on the next morning monitoring it if it still running, or get the `.
    that is idle.
    ## Solution
    * The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up.
    1. The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up.
    2. Spawn EC2 server to train the model.
  4. ammarsaf revised this gist Sep 19, 2023. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion observation-train-yolov5-sagemaker.md
    Original file line number Diff line number Diff line change
    @@ -23,4 +23,4 @@ and wake up on the next morning monitoring it if it still running, or get the `.
    that is idle.
    ## Solution
    * not yet found
    * The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up.
  5. ammarsaf revised this gist Jun 1, 2023. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion observation-train-yolov5-sagemaker.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,5 @@
    # Details
    # Title: Unsolve training Yolov5 on Sagemaker all night issue

    Datetime: 1/6/2023, 10:25 am

    ## Observation
  6. ammarsaf created this gist Jun 1, 2023.
    25 changes: 25 additions & 0 deletions observation-train-yolov5-sagemaker.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,25 @@
    # Details
    Datetime: 1/6/2023, 10:25 am

    ## Observation
    * What I want is to train the model all night on AWS. But running all night, I mean that I shut off my laptop and let it train on the cloud
    and wake up on the next morning monitoring it if it still running, or get the `.pt` file as the training completed.

    ## Issues
    1. Cannot close the broswer and let it train
    2. Instance is assumed to be idle even there is a training process is run

    ## Actions
    * I have tried 2 ways to solve a this issues:

    1. Running directly on Notebook.
    * Status: Failed to run after I close the JupyterLab browser
    2. Runing on termnial with `tmux`
    * Status: Failed
    * I've created a session and start the training. Then, I detached.
    * I closed the browswer and the training is still work. This solve the first issue.
    * However, after several epochs (monitored on ClearML), the training is failed. I presumed the Sagemaker assumed that
    that is idle.
    ## Solution
    * not yet found