seanorama · February 19, 2020 13:48 · Feb 13, 2020 · Feb 13, 2020 · Jan 22, 2020 · Jan 22, 2020
diff --git a/hfds-compress-files.md b/hfds-compress-files.md
@@ -4,11 +4,10 @@ This hacky method processes 1 file at a time:
 1. **copy to a local disk**
 2. compress
 3. put back onto HDFS
-4. delete original file from HDFS
+4. delete original file from HDFS and compressed file from local disk.
 
-So **BE CAREFUL**:
-- inspect the list of files to be compressed before executing!
-- if not you could fill the local disk, or being dealing with a compression that takes a very long time.
+BE CAREFUL: **Before executing, inspect the size of each file!**
+- The risk is: a single large file could fill the local disk or you could leave the server compressing a single large file for hours.
 
 # How
 

diff --git a/hfds-compress-files.md b/hfds-compress-files.md
@@ -1,8 +1,16 @@
 # Compress files which are already on HDFS
 
-This is a hacky approach involving copying each file locally, compressing, and then putting back onto HDFS. This obviously won't work if the file is larger than can fit in any local disk.
+This hacky method processes 1 file at a time:
+1. **copy to a local disk**
+2. compress
+3. put back onto HDFS
+4. delete original file from HDFS
 
-# How:
+So **BE CAREFUL**:
+- inspect the list of files to be compressed before executing!
+- if not you could fill the local disk, or being dealing with a compression that takes a very long time.
+
+# How
 
 1. (optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.
 

diff --git a/hfds-compress-files.md b/hfds-compress-files.md
@@ -8,7 +8,7 @@ This is a hacky approach involving copying each file locally, compressing, and t
 
 2. (optional) Become HDFS and kinit. You can do this as any user that can access the files.
 ```
-sudo -u hfds -i
+sudo -u hdfs -i
 
 keytab=/etc/security/keytabs/hdfs.headless.keytab
 kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)

diff --git a/hfds-compress-files.md b/hfds-compress-files.md
@@ -4,11 +4,12 @@ This is a hacky approach involving copying each file locally, compressing, and t
 
 # How:
 
-1. SSH to a data node.
+1. (optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.
 
-2. Become HDFS and kinit.
+2. (optional) Become HDFS and kinit. You can do this as any user that can access the files.
 ```
 sudo -u hfds -i
+
 keytab=/etc/security/keytabs/hdfs.headless.keytab
 kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
 ```

diff --git a/hfds-compress-files.md b/hfds-compress-files.md
@@ -0,0 +1,33 @@
+# Compress files which are already on HDFS
+
+This is a hacky approach involving copying each file locally, compressing, and then putting back onto HDFS. This obviously won't work if the file is larger than can fit in any local disk.
+
+# How:
+
+1. SSH to a data node.
+
+2. Become HDFS and kinit.
+```
+sudo -u hfds -i
+keytab=/etc/security/keytabs/hdfs.headless.keytab
+kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
+```
+
+3. Change to a partition that is big enough to hold 1-2 of the uncompressed files:
+
+4. Get list of files (this example is getting Ranger YARN audits)
+```
+files=$(hdfs dfs -find /ranger/audit/yarn | grep -Ev "($(date '+%Y%m%d')|$(date -d yesterday +'%Y%m%d'))" | grep .log$)
+```
+5. Compress and remove uncompressed
+```
+for file in ${files}; do
+  filename="$(basename ${file})"
+  filedir="$(dirname ${file})"
+  hdfs dfs -copyToLocal "${file}" &&
+  gzip "${filename}" &&
+  hdfs dfs -moveFromLocal "${filename}".gz "${filedir}/"  &&
+  hdfs dfs -stat "${file}.gz" &&
+  hdfs dfs -rm -skipTrash "${file}"
+done
+```
No results found