|
|
@@ -0,0 +1,132 @@ |
|
|
# So you want to modify the text of a PDF by hand... |
|
|
|
|
|
If you, like me, resent every dollar spent on commercial PDF tools, |
|
|
you might want to know how to change the text content of a PDF without |
|
|
having to pay for Adobe Acrobat or another PDF tool. I didn't see an |
|
|
obvious open-source tool that lets you dig into PDF internals, but I |
|
|
did discover a few useful facts about how PDFs are structured that |
|
|
I think may prove useful to others (or myself) in the future. They |
|
|
are recorded here. They are surely not universally applicable -- |
|
|
the PDF standard is truly Byzantine -- but they worked for my case. |
|
|
|
|
|
This guide is Mac-oriented, but the tools are all available via most |
|
|
linux distributions as well. |
|
|
|
|
|
## Viewing compressed text data |
|
|
|
|
|
You can open a PDF in a text editor and see some stuff that looks kinda |
|
|
readable, in a vague way, but find that none of it is the actual text |
|
|
of the PDF. It turns out that many PDFs store the text data in a |
|
|
compressed form. To view the compressed data, you can use a command line |
|
|
tool called `qpdf`. For Macs, there's a [homebrew formula](https://formulae.brew.sh/formula/qpdf). |
|
|
|
|
|
Here's a command that decompresses all compressed text streams in a |
|
|
given PDF (via [this stackoverflow post](https://stackoverflow.com/a/11732099/577088)): |
|
|
|
|
|
qpdf --qdf --object-streams=disable in.pdf out.pdf |
|
|
|
|
|
You can recompress the streams like so: |
|
|
|
|
|
qpdf out-edited.pdf out-recompressed.pdf |
|
|
|
|
|
This second command generated some errors for me, but the resulting PDF |
|
|
was readable using Preview. |
|
|
|
|
|
## Finding the text data |
|
|
|
|
|
Once you've decompressed the compressed text streams, you can open the |
|
|
PDF in a text editor and view them! Except you have to find them. Here's |
|
|
what they look like in a basic form: |
|
|
|
|
|
BT |
|
|
/Font_0 12 Tf |
|
|
288 720 Td |
|
|
<002a004800570003003600480057> Tj |
|
|
ET |
|
|
|
|
|
The [PDF Reference](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf) |
|
|
(Third Edition, p.293) has this to say about the above: |
|
|
|
|
|
> The five lines of this example perform the following steps: |
|
|
> |
|
|
> 1. Begin a text object. |
|
|
> 2. Set the font and font size to use, installing them as parameters in the text state... |
|
|
> 3. Specify a starting position on the page, setting parameters in the text object. |
|
|
> 4. Paint the glyphs for a string of characters there. |
|
|
> 5. End the text object. |
|
|
|
|
|
## Actually reading the text |
|
|
|
|
|
As you can see from the above example, we *still* can't read the text. |
|
|
It is encoded. And if you thought to yourself "look at that hex string, |
|
|
I bet it's a bunch of unicode code points" -- well, I wish we lived in |
|
|
a kinder world too. It seems there are a million ways to specify encodings |
|
|
in PDFs, including *custom encodings that are embedded in the file itself*. |
|
|
Those encodings *do* map to unicode code points (most of the time?), so that's |
|
|
good. Let's assume that the file you're working with does have embedded |
|
|
encodings (because I have no idea how to handle other cases). |
|
|
|
|
|
### Identifying fonts associated with embedded encodings |
|
|
|
|
|
Text encodings in PDFs are linked to specific fonts. Information about those |
|
|
encodings is embedded in the PDF in ways I don't understand, but there's an |
|
|
existing command line tool that extracts it: `pdffonts`. Here's an example |
|
|
of the output it generates: |
|
|
|
|
|
$ pdffonts sample.pdf |
|
|
name type emb sub uni prob object ID |
|
|
------------------------------------ ----------------- --- --- --- ---- --------- |
|
|
CLDQZB+TrebuchetMS,Bold CID TrueType yes yes yes 9 0 |
|
|
YQBAIZ+TrebuchetMS CID TrueType yes yes yes 10 0 |
|
|
|
|
|
Here, the relevant fields are "emb" (meaning the encoding is embedded in |
|
|
the PDF) and "uni" (meaning the encoding is to unicode code points rather |
|
|
than to raw glyphs). Assuming both are set to "yes," we're in luck. |
|
|
|
|
|
In the text example above, you'll notice the `\Font_0` descriptor. Not |
|
|
all fonts in all PDFs will work this way, but in my case, those labels |
|
|
lined up in a straightforward way with the listing of fonts above. (So |
|
|
`\Font_0` is referring to the font named `CLDQZB+TrebuchetMS,Bold` in the |
|
|
above table.) |
|
|
|
|
|
### Finding the embedded encoding table for the given font |
|
|
|
|
|
Once you have determined the full name of your text's font (like |
|
|
`CLDQZB+TrebuchetMS,Bold`) you can search for it. In my case it appeared |
|
|
several times, but in one particular case, it appeared in a short |
|
|
block of commands including one that looked like this: |
|
|
|
|
|
/ToUnicode 19 0 R |
|
|
|
|
|
This appears to specify the object id of the encoding table. If you then |
|
|
search for `19 0 obj`, you'll find the table. (Or at least that's how |
|
|
it worked in my case!) |
|
|
|
|
|
### The encoding table format |
|
|
|
|
|
The salient part of the encoding table looks like this: |
|
|
|
|
|
38 beginbfrange^M |
|
|
<0036><0036><0053>^M |
|
|
<0057><0057><0074>^M |
|
|
<0044><0044><0061>^M |
|
|
<0048><0048><0065>^M |
|
|
<0050><0050><006D>^M |
|
|
... |
|
|
|
|
|
If yours looks different, check out the [ToUnicode mapping file tutorial](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf) |
|
|
which describes a bunch of possible variations. In this case, the table is |
|
|
mapping ranges of custom encoding points to unicode points -- except these |
|
|
are ranges of just one character. So here, the custom point `0036` maps to the |
|
|
unicode point `0053` -- that is, the digit `5`. |
|
|
|
|
|
To perform this translation in an automated way, I used Python to convert the |
|
|
table into a dictionary, and wrote some simple encoding and decoding functions. |
|
|
This isn't a Python tutorial, sadly, but if you know Python or any other scripting |
|
|
language, you can probably work out a few different ways to solve this part of |
|
|
the problem. |
|
|
|
|
|
Equipped with my encoder and decoder, I determined the custom-encoded version of |
|
|
the text I wanted to replace, wrote the replacement text and custom-encoded it, |
|
|
and used find-and-replace to swap them out. The end! |
|
|
|