-
-
Save y4code/106520f7673030fe0cdaeea8e761fb91 to your computer and use it in GitHub Desktop.
Revisions
-
julienma revised this gist
Jul 16, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -11,7 +11,7 @@ on performSmartRule(theRecords) try step progress indicator filename of theRecord as string set strRecordPath to quoted form of (path of theRecord as string) set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & space & strRecordPath & space & strRecordPath do shell script strCmd on error error_message number error_number set tags of theRecord to (tags of theRecord) & "ocr_error" -
julienma revised this gist
Jul 16, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -11,7 +11,7 @@ on performSmartRule(theRecords) try step progress indicator filename of theRecord as string set strRecordPath to quoted form of (path of theRecord as string) set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & strRecordPath & space & strRecordPath do shell script strCmd on error error_message number error_number set tags of theRecord to (tags of theRecord) & "ocr_error" -
julienma revised this gist
May 17, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -30,7 +30,7 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule... - Search in: `Databases` - Search all: - `Kind` is `PDF/PS` - `Extension` is `PDF document` (_required as some .AI files are recognized as PDF kind as well_) - `Word Count` is `0` - `Tag` is not `ocr_error` (_this is how we automatically exclude files which couldn't be OCR'd for some reason_) - `Tag` is not `ocr_ignore` (_this is how we **manually** exclude files which we don't want to OCR_) -
julienma revised this gist
May 17, 2019 . 1 changed file with 5 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -20,7 +20,9 @@ brew install tesseract-lang - Set the [OCRmyPDF parameters you need](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#image-processing) in `strCmd`, specifically if you want to [use other languages](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-languages-other-than-english), e.g. for French: `-l fra` (you can get other language codes with `tesseract --list-langs)`. - Finally copy the script file to `~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules` (path for DEVONthink 3 beta) # Create smart rules in DEVONthink ## Automatic OCR Create a new smart rule by right-clicking in sidebar > New Smart Rule... @@ -35,6 +37,8 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule... - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_) - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF` ## OCR failures To list files which couldn't be OCR'd for some reason, create another smart rule: - Name: `OCR errors` -
julienma revised this gist
May 17, 2019 . 1 changed file with 13 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -35,9 +35,6 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule... - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_) - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF` To list files which couldn't be OCR'd for some reason, create another smart rule: - Name: `OCR errors` @@ -49,8 +46,19 @@ To list files which couldn't be OCR'd for some reason, create another smart rule - Bounce Dock Icon - Display Notification: `Some PDFs cannot be OCR'd.` # Usage ## Automatic OCR First rule `PDFs without OCR` will show you all the PDF files which require OCR. OCR will be triggered every day (early morning for me, when laptop automatically wakes up to backup). To bypass OCR for some files, add tag `ocr_ignore`. ## OCR failures Second rule will show a weekly reminder when there's some file waiting to be checked manually. To get details about why OCR didn't succeed, try running the `ocrmypdf` command manually on the files. One possible fix is to try to [force OCR](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#redo-existing-ocr) (try first with `--redo-ocr` before doing `--force-ocr`). Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again. -
julienma revised this gist
May 17, 2019 . 1 changed file with 3 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -28,8 +28,10 @@ Create a new smart rule by right-clicking in sidebar > New Smart Rule... - Search in: `Databases` - Search all: - `Kind` is `PDF/PS` - `Extension` is `PDF document` - `Word Count` is `0` - `Tag` is not `ocr_error` (_this is how we automatically exclude files which couldn't be OCR'd for some reason_) - `Tag` is not `ocr_ignore` (_this is how we **manually** exclude files which we don't want to OCR_) - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_) - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF` @@ -41,8 +43,6 @@ To list files which couldn't be OCR'd for some reason, create another smart rule - Name: `OCR errors` - Search in: `Databases` - Search all: - `Tag` is `ocr_error` - Perform the following actions: `Weekly` - Actions: -
julienma created this gist
May 15, 2019 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,23 @@ -- Script for DEVONthink 3 -- Run OCRmyPDF on PDFs without OCR -- Requires https://github.com/jbarlow83/OCRmyPDF to be installed e.g. with brew on performSmartRule(theRecords) tell application id "DNtp" set strExportPath to "PATH=/usr/local/bin:$PATH " set intRecordsCount to count of theRecords show progress indicator "Adding OCR to PDF..." steps intRecordsCount repeat with theRecord in theRecords try step progress indicator filename of theRecord as string set strRecordPath to quoted form of (path of theRecord as string) set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean --mask-barcodes " & strRecordPath & space & strRecordPath do shell script strCmd on error error_message number error_number set tags of theRecord to (tags of theRecord) & "ocr_error" if the error_number is not -128 then display notification error_message with title "Error with OCR" subtitle (filename of theRecord as string) end try end repeat hide progress indicator end tell end performSmartRule This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,56 @@ DEVONthink 3 script to automatically OCR PDFs using a local install of https://ocrmypdf.readthedocs.io/ # Install OCRmyPDF Easiest is to use homebrew: ``` brew install ocrmypdf ``` If you need languages other than English, install additional language pack: ``` brew install tesseract-lang ``` # Customize the `.scpt` script - Set `strExportPath` to include the `ocrmypdf` binary path. Default value is valid for an install with homebrew. - Set the [OCRmyPDF parameters you need](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#image-processing) in `strCmd`, specifically if you want to [use other languages](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-languages-other-than-english), e.g. for French: `-l fra` (you can get other language codes with `tesseract --list-langs)`. - Finally copy the script file to `~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules` (path for DEVONthink 3 beta) # Create a smart rule in DEVONthink Create a new smart rule by right-clicking in sidebar > New Smart Rule... - Name: `PDFs without OCR` - Search in: `Databases` - Search all: - `Kind` is `PDF/PS` - `Word Count` is `0` - `Tag` is not `ocr_error` (_this is how we exclude files which couldn't be OCR'd for some reason_) - Perform the following actions: `Daily` (_as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs_) - Action: `Execute Script` - `External` - `DT3 - Add OCR to PDF` Now this group will show you all the PDF files which require OCR. The rule will be triggered every day (early morning for me, when laptop automatically wakes up to backup). To list files which couldn't be OCR'd for some reason, create another smart rule: - Name: `OCR errors` - Search in: `Databases` - Search all: - `Kind` is `PDF/PS` - `Word Count` is `0` - `Tag` is `ocr_error` - Perform the following actions: `Weekly` - Actions: - Bounce Dock Icon - Display Notification: `Some PDFs cannot be OCR'd.` Now you'll get a weekly reminder when there's some file waiting to be checked manually. To get details about the issue, try running the `ocrmypdf` command manually on the files. One possible fix is to try to [force OCR](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#redo-existing-ocr) (try first with `--redo-ocr` before doing `--force-ocr`). Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.