Skip to content

Instantly share code, notes, and snippets.

@chamrc
Forked from larryxiao/* - pdf
Created October 18, 2017 08:56
Show Gist options
  • Select an option

  • Save chamrc/0966f16e2e1700b98a3973d1c84e6114 to your computer and use it in GitHub Desktop.

Select an option

Save chamrc/0966f16e2e1700b98a3973d1c84e6114 to your computer and use it in GitHub Desktop.
extract text from pdf then remove unnecessary characters change '\n' into '||' change \f' into ' '
libreoffice --convert-to pdf *.ppt
20130607
CONVERT
EXTRACT
CLEANUP
libreoffice --convert-to pdf *.ppt
pdf2txt - extracts text contents of PDF files
pdftk
pdftk 1.pdf 2.pdf 3.pdf cat output merged.pdf
in alphabetical order: pdftk *.pdf cat output merged.pdf
#!/bin/bash
for f in *.txt
do
echo "Processing $f file... \"$f"
tr '\n' '||' < "$f" > "$f.temp"
tr '\f' ' ' < "$f.temp" > "$f.out"
mv "$f.out" ./out
rm "$f.temp"
# take action on each file. $f store current file name
# cat $f
done
#!/bin/bash
#FILES=./*.pdf
#Processing ./20130604202323560.pdf file... "output./20130604202323560.pdf
#for f in $FILES
#Processing 20130604202323560.pdf file... "output20130604202323560.pdf
for f in *.pdf
do
echo "Processing $f file... \"output$f.txt"
pdf2txt -o "output$f.txt" $f
# take action on each file. $f store current file name
# cat $f
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment