y4code · March 1, 2023 22:49 · Jul 16, 2019 · Jul 16, 2019 · May 17, 2019 · May 17, 2019
diff --git a/DT3 - Add OCR to PDF.scpt b/DT3 - Add OCR to PDF.scpt
@@ -11,7 +11,7 @@ on performSmartRule(theRecords)
 			try
 				step progress indicator filename of theRecord as string
 				set strRecordPath to quoted form of (path of theRecord as string)
-				set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & strRecordPath & space & strRecordPath
+				set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & space & strRecordPath & space & strRecordPath
 				do shell script strCmd
 			on error error_message number error_number
 				set tags of theRecord to (tags of theRecord) & "ocr_error"

diff --git a/DT3 - Add OCR to PDF.scpt b/DT3 - Add OCR to PDF.scpt
@@ -11,7 +11,7 @@ on performSmartRule(theRecords)
 			try
 				step progress indicator filename of theRecord as string
 				set strRecordPath to quoted form of (path of theRecord as string)
-				set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean --mask-barcodes " & strRecordPath & space & strRecordPath
+				set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & strRecordPath & space & strRecordPath
 				do shell script strCmd
 			on error error_message number error_number
 				set tags of theRecord to (tags of theRecord) & "ocr_error"

diff --git a/README.md b/README.md
@@ -30,7 +30,7 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
 - Search in: `Databases`
 - Search all:
   - `Kind` is `PDF/PS`
-  - `Extension` is `PDF document`
+  - `Extension` is `PDF document` (_required as some .AI files are recognized as PDF kind as well_)
   - `Word Count` is `0`
   - `Tag` is not `ocr_error` (_this is how we automatically exclude files which couldn't be OCR'd for some reason_)
   - `Tag` is not `ocr_ignore` (_this is how we **manually** exclude files which we don't want to OCR_)

diff --git a/README.md b/README.md
@@ -20,7 +20,9 @@ brew install tesseract-lang
 - Set the [OCRmyPDF parameters you need](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#image-processing) in `strCmd`, specifically if you want to [use other languages](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-languages-other-than-english), e.g. for French: `-l fra` (you can get other language codes with `tesseract --list-langs)`.
 - Finally copy the script file to `~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules` (path for DEVONthink 3 beta)
 
-# Create a smart rule in DEVONthink
+# Create smart rules in DEVONthink
+
+## Automatic OCR
 
 Create a new smart rule by right-clicking in sidebar > New Smart Rule...
 
@@ -35,6 +37,8 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
 - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
 - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`
 
+## OCR failures
+
 To list files which couldn't be OCR'd for some reason, create another smart rule:
 
 - Name: `OCR errors`

diff --git a/README.md b/README.md
@@ -35,9 +35,6 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
 - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
 - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`
 
-Now this group will show you all the PDF files which require OCR.
-The rule will be triggered every day (early morning for me, when laptop automatically wakes up to backup).
-
 To list files which couldn't be OCR'd for some reason, create another smart rule:
 
 - Name: `OCR errors`
@@ -49,8 +46,19 @@ To list files which couldn't be OCR'd for some reason, create another smart rule
   - Bounce Dock Icon
   - Display Notification: `Some PDFs cannot be OCR'd.`
 
-Now you'll get a weekly reminder when there's some file waiting to be checked manually.
-To get details about the issue, try running the `ocrmypdf` command manually on the files.
+# Usage
+
+## Automatic OCR
+
+First rule `PDFs without OCR` will show you all the PDF files which require OCR.
+OCR will be triggered every day (early morning for me, when laptop automatically wakes up to backup).
+
+To bypass OCR for some files, add tag `ocr_ignore`.
+
+## OCR failures
+
+Second rule will show a weekly reminder when there's some file waiting to be checked manually.
+To get details about why OCR didn't succeed, try running the `ocrmypdf` command manually on the files.
 
 One possible fix is to try to [force OCR](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#redo-existing-ocr) (try first with `--redo-ocr` before doing `--force-ocr`).
 Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.
diff --git a/README.md b/README.md
@@ -28,8 +28,10 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule...
 - Search in: `Databases`
 - Search all:
   - `Kind` is `PDF/PS`
+  - `Extension` is `PDF document`
   - `Word Count` is `0`
-  - `Tag` is not `ocr_error` (_this is how we exclude files which couldn't be OCR'd for some reason_)
+  - `Tag` is not `ocr_error` (_this is how we automatically exclude files which couldn't be OCR'd for some reason_)
+  - `Tag` is not `ocr_ignore` (_this is how we **manually** exclude files which we don't want to OCR_)
 - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
 - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`
 
@@ -41,8 +43,6 @@ To list files which couldn't be OCR'd for some reason, create another smart rule
 - Name: `OCR errors`
 - Search in: `Databases`
 - Search all:
-  - `Kind` is `PDF/PS`
-  - `Word Count` is `0`
   - `Tag` is `ocr_error`
 - Perform the following actions: `Weekly`
 - Actions: 

diff --git a/DT3 - Add OCR to PDF.scpt b/DT3 - Add OCR to PDF.scpt
@@ -0,0 +1,23 @@
+-- Script for DEVONthink 3
+-- Run OCRmyPDF on PDFs without OCR
+-- Requires https://github.com/jbarlow83/OCRmyPDF to be installed e.g. with brew
+
+on performSmartRule(theRecords)
+	tell application id "DNtp"
+		set strExportPath to "PATH=/usr/local/bin:$PATH "
+		set intRecordsCount to count of theRecords
+		show progress indicator "Adding OCR to PDF..." steps intRecordsCount
+		repeat with theRecord in theRecords
+			try
+				step progress indicator filename of theRecord as string
+				set strRecordPath to quoted form of (path of theRecord as string)
+				set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean --mask-barcodes " & strRecordPath & space & strRecordPath
+				do shell script strCmd
+			on error error_message number error_number
+				set tags of theRecord to (tags of theRecord) & "ocr_error"
+				if the error_number is not -128 then display notification error_message with title "Error with OCR" subtitle (filename of theRecord as string)
+			end try
+		end repeat
+		hide progress indicator
+	end tell
+end performSmartRule
diff --git a/README.md b/README.md
@@ -0,0 +1,56 @@
+DEVONthink 3 script to automatically OCR PDFs using a local install of https://ocrmypdf.readthedocs.io/
+
+# Install OCRmyPDF
+
+Easiest is to use homebrew:
+
+```
+brew install ocrmypdf
+```
+
+If you need languages other than English, install additional language pack:
+
+```
+brew install tesseract-lang
+```
+
+# Customize the `.scpt` script
+
+- Set `strExportPath` to include the `ocrmypdf` binary path. Default value is valid for an install with homebrew.
+- Set the [OCRmyPDF parameters you need](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#image-processing) in `strCmd`, specifically if you want to [use other languages](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-languages-other-than-english), e.g. for French: `-l fra` (you can get other language codes with `tesseract --list-langs)`.
+- Finally copy the script file to `~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules` (path for DEVONthink 3 beta)
+
+# Create a smart rule in DEVONthink
+
+Create a new smart rule by right-clicking in sidebar > New Smart Rule...
+
+- Name: `PDFs without OCR`
+- Search in: `Databases`
+- Search all:
+  - `Kind` is `PDF/PS`
+  - `Word Count` is `0`
+  - `Tag` is not `ocr_error` (_this is how we exclude files which couldn't be OCR'd for some reason_)
+- Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_)
+- Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF`
+
+Now this group will show you all the PDF files which require OCR.
+The rule will be triggered every day (early morning for me, when laptop automatically wakes up to backup).
+
+To list files which couldn't be OCR'd for some reason, create another smart rule:
+
+- Name: `OCR errors`
+- Search in: `Databases`
+- Search all:
+  - `Kind` is `PDF/PS`
+  - `Word Count` is `0`
+  - `Tag` is `ocr_error`
+- Perform the following actions: `Weekly`
+- Actions: 
+  - Bounce Dock Icon
+  - Display Notification: `Some PDFs cannot be OCR'd.`
+
+Now you'll get a weekly reminder when there's some file waiting to be checked manually.
+To get details about the issue, try running the `ocrmypdf` command manually on the files.
+
+One possible fix is to try to [force OCR](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#redo-existing-ocr) (try first with `--redo-ocr` before doing `--force-ocr`).
+Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.
No results found