Skip to content

Instantly share code, notes, and snippets.

@ablwr
Last active May 14, 2020 15:01
Show Gist options
  • Select an option

  • Save ablwr/83005f4b01cfdd1097a7f6fd70ff21f4 to your computer and use it in GitHub Desktop.

Select an option

Save ablwr/83005f4b01cfdd1097a7f6fd70ff21f4 to your computer and use it in GitHub Desktop.

Revisions

  1. ablwr renamed this gist May 14, 2020. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. ablwr created this gist May 14, 2020.
    39 changes: 39 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,39 @@
    #!/bin/bash

    basefolder="/home/ashley/Development/personal/vasulka-archive-archive/ocr"

    for i in $(find * -iname '*.pdf');
    do
    if [ `dirname $i` != "." ]
    then
    dirpath="${i%/*}"
    dir_arr=(`echo $dirpath | tr "/" "\n"`)
    path=""
    for x in "${dir_arr[@]}"
    do
    if [ -z "$path" ]
    then
    path=$x
    mkdir -p $basefolder$path
    else
    path=$path"/"$x
    mkdir -p $basefolder$path
    fi
    done
    ext="."${i##*.}
    output=${i/$ext/".txt"}
    if [ ! -f $basefolder$output ] || [ $i -nt $basefolder$output ]
    then
    echo $i
    pdftotext -enc ASCII7 $i $basefolder$output
    fi
    else
    ext="."${i##*.}
    output=${i/$ext/".txt"}
    if [ ! -f $basefolder$output ] || [ $i -nt $basefolder$output ]
    then
    echo $i
    pdftotext -enc ASCII7 $i $basefolder$output
    fi
    fi
    done