Skip to content

Instantly share code, notes, and snippets.

@y4code
Forked from julienma/DT3 - Add OCR to PDF.scpt
Created March 1, 2023 22:49
Show Gist options
  • Select an option

  • Save y4code/106520f7673030fe0cdaeea8e761fb91 to your computer and use it in GitHub Desktop.

Select an option

Save y4code/106520f7673030fe0cdaeea8e761fb91 to your computer and use it in GitHub Desktop.

Revisions

  1. @julienma julienma revised this gist Jul 16, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion DT3 - Add OCR to PDF.scpt
    Original file line number Diff line number Diff line change
    @@ -11,7 +11,7 @@ on performSmartRule(theRecords)
    try
    step progress indicator filename of theRecord as string
    set strRecordPath to quoted form of (path of theRecord as string)
    set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & strRecordPath & space & strRecordPath
    set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & space & strRecordPath & space & strRecordPath
    do shell script strCmd
    on error error_message number error_number
    set tags of theRecord to (tags of theRecord) & "ocr_error"
  2. @julienma julienma revised this gist Jul 16, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion DT3 - Add OCR to PDF.scpt
    Original file line number Diff line number Diff line change
    @@ -11,7 +11,7 @@ on performSmartRule(theRecords)
    try
    step progress indicator filename of theRecord as string
    set strRecordPath to quoted form of (path of theRecord as string)
    set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean --mask-barcodes " & strRecordPath & space & strRecordPath
    set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & strRecordPath & space & strRecordPath
    do shell script strCmd
    on error error_message number error_number
    set tags of theRecord to (tags of theRecord) & "ocr_error"
  3. @julienma julienma revised this gist May 17, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -30,7 +30,7 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
    - Search in: `Databases`
    - Search all:
    - `Kind` is `PDF/PS`
    - `Extension` is `PDF document`
    - `Extension` is `PDF document` (_required as some .AI files are recognized as PDF kind as well_)
    - `Word Count` is `0`
    - `Tag` is not `ocr_error` (_this is how we automatically exclude files which couldn't be OCR'd for some reason_)
    - `Tag` is not `ocr_ignore` (_this is how we **manually** exclude files which we don't want to OCR_)
  4. @julienma julienma revised this gist May 17, 2019. 1 changed file with 5 additions and 1 deletion.
    6 changes: 5 additions & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -20,7 +20,9 @@ brew install tesseract-lang
    - Set the [OCRmyPDF parameters you need](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#image-processing) in `strCmd`, specifically if you want to [use other languages](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-languages-other-than-english), e.g. for French: `-l fra` (you can get other language codes with `tesseract --list-langs)`.
    - Finally copy the script file to `~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules` (path for DEVONthink 3 beta)

    # Create a smart rule in DEVONthink
    # Create smart rules in DEVONthink

    ## Automatic OCR

    Create a new smart rule by right-clicking in sidebar > New Smart Rule...

    @@ -35,6 +37,8 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
    - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
    - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`

    ## OCR failures

    To list files which couldn't be OCR'd for some reason, create another smart rule:

    - Name: `OCR errors`
  5. @julienma julienma revised this gist May 17, 2019. 1 changed file with 13 additions and 5 deletions.
    18 changes: 13 additions & 5 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -35,9 +35,6 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
    - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
    - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`

    Now this group will show you all the PDF files which require OCR.
    The rule will be triggered every day (early morning for me, when laptop automatically wakes up to backup).

    To list files which couldn't be OCR'd for some reason, create another smart rule:

    - Name: `OCR errors`
    @@ -49,8 +46,19 @@ To list files which couldn't be OCR'd for some reason, create another smart rule
    - Bounce Dock Icon
    - Display Notification: `Some PDFs cannot be OCR'd.`

    Now you'll get a weekly reminder when there's some file waiting to be checked manually.
    To get details about the issue, try running the `ocrmypdf` command manually on the files.
    # Usage

    ## Automatic OCR

    First rule `PDFs without OCR` will show you all the PDF files which require OCR.
    OCR will be triggered every day (early morning for me, when laptop automatically wakes up to backup).

    To bypass OCR for some files, add tag `ocr_ignore`.

    ## OCR failures

    Second rule will show a weekly reminder when there's some file waiting to be checked manually.
    To get details about why OCR didn't succeed, try running the `ocrmypdf` command manually on the files.

    One possible fix is to try to [force OCR](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#redo-existing-ocr) (try first with `--redo-ocr` before doing `--force-ocr`).
    Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.
  6. @julienma julienma revised this gist May 17, 2019. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -28,8 +28,10 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
    - Search in: `Databases`
    - Search all:
    - `Kind` is `PDF/PS`
    - `Extension` is `PDF document`
    - `Word Count` is `0`
    - `Tag` is not `ocr_error` (_this is how we exclude files which couldn't be OCR'd for some reason_)
    - `Tag` is not `ocr_error` (_this is how we automatically exclude files which couldn't be OCR'd for some reason_)
    - `Tag` is not `ocr_ignore` (_this is how we **manually** exclude files which we don't want to OCR_)
    - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
    - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`

    @@ -41,8 +43,6 @@ To list files which couldn't be OCR'd for some reason, create another smart rule
    - Name: `OCR errors`
    - Search in: `Databases`
    - Search all:
    - `Kind` is `PDF/PS`
    - `Word Count` is `0`
    - `Tag` is `ocr_error`
    - Perform the following actions: `Weekly`
    - Actions:
  7. @julienma julienma created this gist May 15, 2019.
    23 changes: 23 additions & 0 deletions DT3 - Add OCR to PDF.scpt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,23 @@
    -- Script for DEVONthink 3
    -- Run OCRmyPDF on PDFs without OCR
    -- Requires https://github.com/jbarlow83/OCRmyPDF to be installed e.g. with brew

    on performSmartRule(theRecords)
    tell application id "DNtp"
    set strExportPath to "PATH=/usr/local/bin:$PATH "
    set intRecordsCount to count of theRecords
    show progress indicator "Adding OCR to PDF..." steps intRecordsCount
    repeat with theRecord in theRecords
    try
    step progress indicator filename of theRecord as string
    set strRecordPath to quoted form of (path of theRecord as string)
    set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean --mask-barcodes " & strRecordPath & space & strRecordPath
    do shell script strCmd
    on error error_message number error_number
    set tags of theRecord to (tags of theRecord) & "ocr_error"
    if the error_number is not -128 then display notification error_message with title "Error with OCR" subtitle (filename of theRecord as string)
    end try
    end repeat
    hide progress indicator
    end tell
    end performSmartRule
    56 changes: 56 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,56 @@
    DEVONthink 3 script to automatically OCR PDFs using a local install of https://ocrmypdf.readthedocs.io/

    # Install OCRmyPDF

    Easiest is to use homebrew:

    ```
    brew install ocrmypdf
    ```

    If you need languages other than English, install additional language pack:

    ```
    brew install tesseract-lang
    ```

    # Customize the `.scpt` script

    - Set `strExportPath` to include the `ocrmypdf` binary path. Default value is valid for an install with homebrew.
    - Set the [OCRmyPDF parameters you need](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#image-processing) in `strCmd`, specifically if you want to [use other languages](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-languages-other-than-english), e.g. for French: `-l fra` (you can get other language codes with `tesseract --list-langs)`.
    - Finally copy the script file to `~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules` (path for DEVONthink 3 beta)

    # Create a smart rule in DEVONthink

    Create a new smart rule by right-clicking in sidebar > New Smart Rule...

    - Name: `PDFs without OCR`
    - Search in: `Databases`
    - Search all:
    - `Kind` is `PDF/PS`
    - `Word Count` is `0`
    - `Tag` is not `ocr_error` (_this is how we exclude files which couldn't be OCR'd for some reason_)
    - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
    - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`

    Now this group will show you all the PDF files which require OCR.
    The rule will be triggered every day (early morning for me, when laptop automatically wakes up to backup).

    To list files which couldn't be OCR'd for some reason, create another smart rule:

    - Name: `OCR errors`
    - Search in: `Databases`
    - Search all:
    - `Kind` is `PDF/PS`
    - `Word Count` is `0`
    - `Tag` is `ocr_error`
    - Perform the following actions: `Weekly`
    - Actions:
    - Bounce Dock Icon
    - Display Notification: `Some PDFs cannot be OCR'd.`

    Now you'll get a weekly reminder when there's some file waiting to be checked manually.
    To get details about the issue, try running the `ocrmypdf` command manually on the files.

    One possible fix is to try to [force OCR](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#redo-existing-ocr) (try first with `--redo-ocr` before doing `--force-ocr`).
    Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.