Skip to content

Instantly share code, notes, and snippets.

@Opfour
Forked from senderle/hand-modify-pdf.md
Created September 4, 2023 21:25
Show Gist options
  • Select an option

  • Save Opfour/ce33c8bdaf0addf27711473b9f0fd1cb to your computer and use it in GitHub Desktop.

Select an option

Save Opfour/ce33c8bdaf0addf27711473b9f0fd1cb to your computer and use it in GitHub Desktop.

Revisions

  1. @senderle senderle renamed this gist Sep 23, 2020. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. @senderle senderle created this gist Sep 23, 2020.
    132 changes: 132 additions & 0 deletions hand-modify-pdf.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,132 @@
    # So you want to modify the text of a PDF by hand...

    If you, like me, resent every dollar spent on commercial PDF tools,
    you might want to know how to change the text content of a PDF without
    having to pay for Adobe Acrobat or another PDF tool. I didn't see an
    obvious open-source tool that lets you dig into PDF internals, but I
    did discover a few useful facts about how PDFs are structured that
    I think may prove useful to others (or myself) in the future. They
    are recorded here. They are surely not universally applicable --
    the PDF standard is truly Byzantine -- but they worked for my case.

    This guide is Mac-oriented, but the tools are all available via most
    linux distributions as well.

    ## Viewing compressed text data

    You can open a PDF in a text editor and see some stuff that looks kinda
    readable, in a vague way, but find that none of it is the actual text
    of the PDF. It turns out that many PDFs store the text data in a
    compressed form. To view the compressed data, you can use a command line
    tool called `qpdf`. For Macs, there's a [homebrew formula](https://formulae.brew.sh/formula/qpdf).

    Here's a command that decompresses all compressed text streams in a
    given PDF (via [this stackoverflow post](https://stackoverflow.com/a/11732099/577088)):

    qpdf --qdf --object-streams=disable in.pdf out.pdf

    You can recompress the streams like so:

    qpdf out-edited.pdf out-recompressed.pdf

    This second command generated some errors for me, but the resulting PDF
    was readable using Preview.

    ## Finding the text data

    Once you've decompressed the compressed text streams, you can open the
    PDF in a text editor and view them! Except you have to find them. Here's
    what they look like in a basic form:

    BT
    /Font_0 12 Tf
    288 720 Td
    <002a004800570003003600480057> Tj
    ET

    The [PDF Reference](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf)
    (Third Edition, p.293) has this to say about the above:

    > The five lines of this example perform the following steps:
    >
    > 1. Begin a text object.
    > 2. Set the font and font size to use, installing them as parameters in the text state...
    > 3. Specify a starting position on the page, setting parameters in the text object.
    > 4. Paint the glyphs for a string of characters there.
    > 5. End the text object.

    ## Actually reading the text

    As you can see from the above example, we *still* can't read the text.
    It is encoded. And if you thought to yourself "look at that hex string,
    I bet it's a bunch of unicode code points" -- well, I wish we lived in
    a kinder world too. It seems there are a million ways to specify encodings
    in PDFs, including *custom encodings that are embedded in the file itself*.
    Those encodings *do* map to unicode code points (most of the time?), so that's
    good. Let's assume that the file you're working with does have embedded
    encodings (because I have no idea how to handle other cases).

    ### Identifying fonts associated with embedded encodings

    Text encodings in PDFs are linked to specific fonts. Information about those
    encodings is embedded in the PDF in ways I don't understand, but there's an
    existing command line tool that extracts it: `pdffonts`. Here's an example
    of the output it generates:

    $ pdffonts sample.pdf
    name type emb sub uni prob object ID
    ------------------------------------ ----------------- --- --- --- ---- ---------
    CLDQZB+TrebuchetMS,Bold CID TrueType yes yes yes 9 0
    YQBAIZ+TrebuchetMS CID TrueType yes yes yes 10 0

    Here, the relevant fields are "emb" (meaning the encoding is embedded in
    the PDF) and "uni" (meaning the encoding is to unicode code points rather
    than to raw glyphs). Assuming both are set to "yes," we're in luck.

    In the text example above, you'll notice the `\Font_0` descriptor. Not
    all fonts in all PDFs will work this way, but in my case, those labels
    lined up in a straightforward way with the listing of fonts above. (So
    `\Font_0` is referring to the font named `CLDQZB+TrebuchetMS,Bold` in the
    above table.)

    ### Finding the embedded encoding table for the given font

    Once you have determined the full name of your text's font (like
    `CLDQZB+TrebuchetMS,Bold`) you can search for it. In my case it appeared
    several times, but in one particular case, it appeared in a short
    block of commands including one that looked like this:

    /ToUnicode 19 0 R

    This appears to specify the object id of the encoding table. If you then
    search for `19 0 obj`, you'll find the table. (Or at least that's how
    it worked in my case!)

    ### The encoding table format

    The salient part of the encoding table looks like this:

    38 beginbfrange^M
    <0036><0036><0053>^M
    <0057><0057><0074>^M
    <0044><0044><0061>^M
    <0048><0048><0065>^M
    <0050><0050><006D>^M
    ...

    If yours looks different, check out the [ToUnicode mapping file tutorial](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf)
    which describes a bunch of possible variations. In this case, the table is
    mapping ranges of custom encoding points to unicode points -- except these
    are ranges of just one character. So here, the custom point `0036` maps to the
    unicode point `0053` -- that is, the digit `5`.

    To perform this translation in an automated way, I used Python to convert the
    table into a dictionary, and wrote some simple encoding and decoding functions.
    This isn't a Python tutorial, sadly, but if you know Python or any other scripting
    language, you can probably work out a few different ways to solve this part of
    the problem.

    Equipped with my encoder and decoder, I determined the custom-encoded version of
    the text I wanted to replace, wrote the replacement text and custom-encoded it,
    and used find-and-replace to swap them out. The end!