Useful regex during manuscript cleanup

Fix inline spans broken by a space after a word before the closing *
Remove image width and height inherited from images in docx
Wrap all images in an Electric Book figure blockquote
Simplify indentation in lists by reducing space after list marker to one space
Remove non-kramdown markdown ^ around superscripts after numbers
Wrap the opening words of a paragraph in a span
Find an email address
Find URLs
Replace single line breaks, keeping empty lines
Replace ##Close-up double-hash headings### with ## kramdown headings

My usual process for moving a book manuscript from Word to kramdown involves:

Convert Word to markdown using Pandoc. To make this easy, I use this batch script in Windows.
Run a series of regex search-and-replaces. These vary from job to job, to suit the book. This document lists common ones. I use Sublime Text for this, but these should work in most good text editors (e.g. Atom, Brackets).
Manually fix and improve the markdown referring visually to the source Word document. This takes a human, because many authors use formatting for semantic purposes and that formatting doesn't convert to markdown.

In each example below, the first line is find, the second line replace (unless replace should be blank to delete content).

Never use these to replace-all in a long manuscript in one automated step. They are for quickly moving through a manuscript, where you visually confirm every replace.

Fix inline spans broken by a space after a word before the closing `*`

This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because *this *is broken while *this* is correct.

(\*+[\w!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)(\s+)(\*+)
\1\3\2

This regex (1) finds one or more asterisks followed by (2) any word character or punctuation (except asterisks), followed by (3) one or more spaces, followed by (4) one or more asterisks. The replace simply switches the space and the final asterisks.

Remove image width and height inherited from images in docx

\{width.+?\}
\n

Wrap all images in an Electric Book figure blockquote

\n(!.+)
\n> \1\n{:.figure}\n

Simplify indentation in lists by reducing space after list marker to one space

\n-\s+
\n-

Note the space at the end of the replace expression.

Remove non-kramdown markdown `^` around superscripts after numbers

(\d)\^(th|nd|st|rd)\^
\1\2

Note: afterwards, do a manual search for ^, because if in the docx source a following character was mistakenly made superscript, too (e.g. 3^rd)^), this regex won't find it.

Wrap the opening words of a paragraph in a span

This particular regex finds the first five words of a paragraph and wraps them in a strong span with a kramdown class attribute of myclass (i.e. <strong class="myclass">).

It looks for a line break, then a word, then three words preceded by a space. Then the replace wraps that all in double asterisks with a kramdown inline attribute.

To change the number of words it selects, change the 3 in braces, near the end of the regex. It should be one less than the number of words you want to select.

The replace regex is simple enough to edit. E.g. change to one asterisk for an em span, and of course change yourclass to the class you need.

\n([\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
\n**\1\2**{:.yourclass}

Find an email address

See this post for details:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

Find URLs

URLs are really hard to find. This gist from John Gruber is your best best.

You can use it to quickly move through a text, replacing where necessary with markdown links, e.g. usng a replace like \[\1\](http://\1).

Replace single line breaks, keeping empty lines

This post explains:

(\S.*?)\R(.*?\S)
$1 $2

Replace `##Close-up double-hash headings###` with `## kramdown headings`

This finds, at the start of a line, one or more hashes, then a string (hopefully heading text), then another string of hashe before a line ending. It replaces it with the same number of hashes at the start, a space before the heading text, and no trailing hashes.

\n([#]+)([^#]+)([#]+)\n
\n\1 \2\n

r2d2m/regex-manuscript-cleanup.md

Select an option

No results found

Select an option

No results found

Useful regex during manuscript cleanup

Fix inline spans broken by a space after a word before the closing `*`

Remove image width and height inherited from images in docx

Wrap all images in an Electric Book figure blockquote

Simplify indentation in lists by reducing space after list marker to one space

Remove non-kramdown markdown `^` around superscripts after numbers

Wrap the opening words of a paragraph in a span

Find an email address

Find URLs

Replace single line breaks, keeping empty lines

Replace `##Close-up double-hash headings###` with `## kramdown headings`

r2d2m/regex-manuscript-cleanup.md

Useful regex during manuscript cleanup

Fix inline spans broken by a space after a word before the closing *

Remove image width and height inherited from images in docx

Wrap all images in an Electric Book figure blockquote

Simplify indentation in lists by reducing space after list marker to one space

Remove non-kramdown markdown ^ around superscripts after numbers

Wrap the opening words of a paragraph in a span

Find an email address

Find URLs

Replace single line breaks, keeping empty lines

Replace ##Close-up double-hash headings### with ## kramdown headings

Fix inline spans broken by a space after a word before the closing `*`

Remove non-kramdown markdown `^` around superscripts after numbers

Replace `##Close-up double-hash headings###` with `## kramdown headings`