Skip to content

Instantly share code, notes, and snippets.

@seanorama
Last active February 19, 2020 13:48
Show Gist options
  • Select an option

  • Save seanorama/768148de376417afdaa00628c611d27d to your computer and use it in GitHub Desktop.

Select an option

Save seanorama/768148de376417afdaa00628c611d27d to your computer and use it in GitHub Desktop.

Revisions

  1. seanorama revised this gist Feb 13, 2020. 1 changed file with 3 additions and 4 deletions.
    7 changes: 3 additions & 4 deletions hfds-compress-files.md
    Original file line number Diff line number Diff line change
    @@ -4,11 +4,10 @@ This hacky method processes 1 file at a time:
    1. **copy to a local disk**
    2. compress
    3. put back onto HDFS
    4. delete original file from HDFS
    4. delete original file from HDFS and compressed file from local disk.

    So **BE CAREFUL**:
    - inspect the list of files to be compressed before executing!
    - if not you could fill the local disk, or being dealing with a compression that takes a very long time.
    BE CAREFUL: **Before executing, inspect the size of each file!**
    - The risk is: a single large file could fill the local disk or you could leave the server compressing a single large file for hours.

    # How

  2. seanorama revised this gist Feb 13, 2020. 1 changed file with 10 additions and 2 deletions.
    12 changes: 10 additions & 2 deletions hfds-compress-files.md
    Original file line number Diff line number Diff line change
    @@ -1,8 +1,16 @@
    # Compress files which are already on HDFS

    This is a hacky approach involving copying each file locally, compressing, and then putting back onto HDFS. This obviously won't work if the file is larger than can fit in any local disk.
    This hacky method processes 1 file at a time:
    1. **copy to a local disk**
    2. compress
    3. put back onto HDFS
    4. delete original file from HDFS

    # How:
    So **BE CAREFUL**:
    - inspect the list of files to be compressed before executing!
    - if not you could fill the local disk, or being dealing with a compression that takes a very long time.

    # How

    1. (optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.

  3. seanorama revised this gist Jan 22, 2020. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion hfds-compress-files.md
    Original file line number Diff line number Diff line change
    @@ -8,7 +8,7 @@ This is a hacky approach involving copying each file locally, compressing, and t

    2. (optional) Become HDFS and kinit. You can do this as any user that can access the files.
    ```
    sudo -u hfds -i
    sudo -u hdfs -i
    keytab=/etc/security/keytabs/hdfs.headless.keytab
    kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
  4. seanorama revised this gist Jan 22, 2020. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions hfds-compress-files.md
    Original file line number Diff line number Diff line change
    @@ -4,11 +4,12 @@ This is a hacky approach involving copying each file locally, compressing, and t

    # How:

    1. SSH to a data node.
    1. (optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.

    2. Become HDFS and kinit.
    2. (optional) Become HDFS and kinit. You can do this as any user that can access the files.
    ```
    sudo -u hfds -i
    keytab=/etc/security/keytabs/hdfs.headless.keytab
    kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
    ```
  5. seanorama created this gist Jan 22, 2020.
    33 changes: 33 additions & 0 deletions hfds-compress-files.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,33 @@
    # Compress files which are already on HDFS

    This is a hacky approach involving copying each file locally, compressing, and then putting back onto HDFS. This obviously won't work if the file is larger than can fit in any local disk.

    # How:

    1. SSH to a data node.

    2. Become HDFS and kinit.
    ```
    sudo -u hfds -i
    keytab=/etc/security/keytabs/hdfs.headless.keytab
    kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
    ```

    3. Change to a partition that is big enough to hold 1-2 of the uncompressed files:

    4. Get list of files (this example is getting Ranger YARN audits)
    ```
    files=$(hdfs dfs -find /ranger/audit/yarn | grep -Ev "($(date '+%Y%m%d')|$(date -d yesterday +'%Y%m%d'))" | grep .log$)
    ```
    5. Compress and remove uncompressed
    ```
    for file in ${files}; do
    filename="$(basename ${file})"
    filedir="$(dirname ${file})"
    hdfs dfs -copyToLocal "${file}" &&
    gzip "${filename}" &&
    hdfs dfs -moveFromLocal "${filename}".gz "${filedir}/" &&
    hdfs dfs -stat "${file}.gz" &&
    hdfs dfs -rm -skipTrash "${file}"
    done
    ```