Opfour · September 4, 2023 21:25 · Sep 23, 2020 · Sep 23, 2020
diff --git a/hand-modify-pdf.txt → hand-modify-pdf.md b/hand-modify-pdf.txt → hand-modify-pdf.md
diff --git a/hand-modify-pdf.txt b/hand-modify-pdf.txt
@@ -0,0 +1,132 @@
+# So you want to modify the text of a PDF by hand...
+
+If you, like me, resent every dollar spent on commercial PDF tools, 
+you might want to know how to change the text content of a PDF without
+having to pay for Adobe Acrobat or another PDF tool. I didn't see an
+obvious open-source tool that lets you dig into PDF internals, but I
+did discover a few useful facts about how PDFs are structured that
+I think may prove useful to others (or myself) in the future. They 
+are recorded here. They are surely not universally applicable --  
+the PDF standard is truly Byzantine -- but they worked for my case.
+
+This guide is Mac-oriented, but the tools are all available via most
+linux distributions as well.
+
+## Viewing compressed text data
+
+You can open a PDF in a text editor and see some stuff that looks kinda
+readable, in a vague way, but find that none of it is the actual text 
+of the PDF. It turns out that many PDFs store the text data in a
+compressed form. To view the compressed data, you can use a command line
+tool called `qpdf`. For Macs, there's a [homebrew formula](https://formulae.brew.sh/formula/qpdf).
+
+Here's a command that decompresses all compressed text streams in a 
+given PDF (via [this stackoverflow post](https://stackoverflow.com/a/11732099/577088)): 
+
+    qpdf --qdf --object-streams=disable in.pdf out.pdf
+
+You can recompress the streams like so:
+
+    qpdf out-edited.pdf out-recompressed.pdf
+
+This second command generated some errors for me, but the resulting PDF
+was readable using Preview.
+
+## Finding the text data
+
+Once you've decompressed the compressed text streams, you can open the
+PDF in a text editor and view them! Except you have to find them. Here's 
+what they look like in a basic form: 
+
+    BT
+      /Font_0 12 Tf
+      288 720 Td
+      <002a004800570003003600480057> Tj
+    ET
+
+The [PDF Reference](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf)
+(Third Edition, p.293) has this to say about the above:
+
+> The five lines of this example perform the following steps:
+> 
+> 1. Begin a text object.
+> 2. Set the font and font size to use, installing them as parameters in the text state...
+> 3. Specify a starting position on the page, setting parameters in the text object.
+> 4. Paint the glyphs for a string of characters there.
+> 5. End the text object.
+
+## Actually reading the text
+
+As you can see from the above example, we *still* can't read the text. 
+It is encoded. And if you thought to yourself "look at that hex string,
+I bet it's a bunch of unicode code points" -- well, I wish we lived in
+a kinder world too. It seems there are a million ways to specify encodings
+in PDFs, including *custom encodings that are embedded in the file itself*.
+Those encodings *do* map to unicode code points (most of the time?), so that's
+good. Let's assume that the file you're working with does have embedded 
+encodings (because I have no idea how to handle other cases).
+
+### Identifying fonts associated with embedded encodings
+
+Text encodings in PDFs are linked to specific fonts. Information about those
+encodings is embedded in the PDF in ways I don't understand, but there's an 
+existing command line tool that extracts it: `pdffonts`. Here's an example 
+of the output it generates:
+
+    $ pdffonts sample.pdf
+    name                                 type              emb sub uni prob object ID
+    ------------------------------------ ----------------- --- --- --- ---- ---------
+    CLDQZB+TrebuchetMS,Bold              CID TrueType      yes yes yes           9  0
+    YQBAIZ+TrebuchetMS                   CID TrueType      yes yes yes          10  0
+
+Here, the relevant fields are "emb" (meaning the encoding is embedded in
+the PDF) and "uni" (meaning the encoding is to unicode code points rather
+than to raw glyphs). Assuming both are set to "yes," we're in luck.
+
+In the text example above, you'll notice the `\Font_0` descriptor. Not
+all fonts in all PDFs will work this way, but in my case, those labels
+lined up in a straightforward way with the listing of fonts above. (So
+`\Font_0` is referring to the font named `CLDQZB+TrebuchetMS,Bold` in the
+above table.)
+
+### Finding the embedded encoding table for the given font
+
+Once you have determined the full name of your text's font (like 
+`CLDQZB+TrebuchetMS,Bold`) you can search for it. In my case it appeared
+several times, but in one particular case, it appeared in a short
+block of commands including one that looked like this:
+
+    /ToUnicode 19 0 R
+
+This appears to specify the object id of the encoding table. If you then 
+search for `19 0 obj`, you'll find the table. (Or at least that's how
+it worked in my case!)
+
+### The encoding table format
+
+The salient part of the encoding table looks like this:
+
+    38 beginbfrange^M
+    <0036><0036><0053>^M
+    <0057><0057><0074>^M
+    <0044><0044><0061>^M
+    <0048><0048><0065>^M
+    <0050><0050><006D>^M
+    ...
+
+If yours looks different, check out the [ToUnicode mapping file tutorial](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf)
+which describes a bunch of possible variations. In this case, the table is 
+mapping ranges of custom encoding points to unicode points -- except these
+are ranges of just one character. So here, the custom point `0036` maps to the
+unicode point `0053` -- that is, the digit `5`.
+
+To perform this translation in an automated way, I used Python to convert the
+table into a dictionary, and wrote some simple encoding and decoding functions.
+This isn't a Python tutorial, sadly, but if you know Python or any other scripting
+language, you can probably work out a few different ways to solve this part of
+the problem.
+
+Equipped with my encoder and decoder, I determined the custom-encoded version of
+the text I wanted to replace, wrote the replacement text and custom-encoded it,
+and used find-and-replace to swap them out. The end!
+
No results found