Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save r2d2m/59d3b5dbb556cd46fea0314a4d7db23f to your computer and use it in GitHub Desktop.
Save r2d2m/59d3b5dbb556cd46fea0314a4d7db23f to your computer and use it in GitHub Desktop.

Revisions

  1. @arthurattwell arthurattwell revised this gist Feb 21, 2019. 1 changed file with 15 additions and 0 deletions.
    15 changes: 15 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -15,6 +15,7 @@
    * [Find double quotes inside double quotes in Liquid tag parameters](#find-double-quotes-inside-double-quotes-in-liquid-tag-parameters)
    * [Replace named HTML entities with numeric entities](#replace-named-html-entities-with-numeric-entities)
    * [Fun demo: fix split infinitives](#fun-demo-fix-split-infinitives)
    * [Add YAML frontmatter before main headings](#add-yaml-frontmatter-before-main-headings)

    My usual process for moving a book manuscript from Word to kramdown involves:

    @@ -209,3 +210,17 @@ This will change 'to boldly go' and 'to plainly say' to 'to go boldly' and 'to s
    ```
    \1 \3 \2
    ```

    ## Add YAML frontmatter before main headings

    ```
    ^#\s(.+)
    ```

    ```
    ---
    title: "$1"
    ---
    # $1
    ```
  2. @arthurattwell arthurattwell revised this gist Jan 15, 2019. 1 changed file with 14 additions and 1 deletion.
    15 changes: 14 additions & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -14,6 +14,7 @@
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-close-up-double-hash-headings-with--kramdown-headings)
    * [Find double quotes inside double quotes in Liquid tag parameters](#find-double-quotes-inside-double-quotes-in-liquid-tag-parameters)
    * [Replace named HTML entities with numeric entities](#replace-named-html-entities-with-numeric-entities)
    * [Fun demo: fix split infinitives](#fun-demo-fix-split-infinitives)

    My usual process for moving a book manuscript from Word to kramdown involves:

    @@ -195,4 +196,16 @@ Note that this will replace only the first ` ` in each table. If a table ha
    You can use the same strings for, say `­` if you:

    1. replace ` ` with `­` in the find string
    2. replace ` ` with `­` in the replace string.
    2. replace ` ` with `­` in the replace string.

    ## Fun demo: fix split infinitives

    This will change 'to boldly go' and 'to plainly say' to 'to go boldly' and 'to say plainly', but it will not change 'to plant flowers'.

    ```
    (\bto\b)\s*(\b[a-z]+?ly\b)\s*([a-z]+)
    ```

    ```
    \1 \3 \2
    ```
  3. @arthurattwell arthurattwell revised this gist Nov 27, 2018. 1 changed file with 9 additions and 1 deletion.
    10 changes: 9 additions & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,7 @@
    # Useful regex during manuscript cleanup

    * [Copy-paste to split book into separate chapter files](#copy-paste-to-split-book-into-separate-chapter-files)
    * [Copy-paste to split book into separate chapter files](#copy-paste-to-split-book-into-separate-chapter-files) (also see [split.sh](https://gist.github.com/arthurattwell/55d168c6b584e901b3f9eaa80ca97063))
    * [Add non-breaking space in range of numbers](#add-non-breaking-space-in-range-of-numbers)
    * [Fix inline spans broken by a space after a word before the closing `*`](#fix-inline-spans-broken-by-a-space-after-a-word-before-the-closing-)
    * [Remove image width and height inherited from images in docx](#remove-image-width-and-height-inherited-from-images-in-docx)
    * [Wrap all images in an Electric Book figure blockquote](#wrap-all-images-in-an-electric-book-figure-blockquote)
    @@ -51,6 +52,13 @@ This will find, cut, open a new file, paste the contents, save them (prompting y

    Then repeat.

    ## Add non-breaking space in range of numbers

    ``` perl
    (\d) (\d)
    \1 \2
    ```

    ## Fix inline spans broken by a space after a word before the closing `*`

    This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because `*this *is broken` while `*this* is correct`.
  4. @arthurattwell arthurattwell revised this gist Jul 4, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -12,7 +12,7 @@
    * [Replace single line breaks, keeping empty lines](#replace-single-line-breaks-keeping-empty-lines)
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-close-up-double-hash-headings-with--kramdown-headings)
    * [Find double quotes inside double quotes in Liquid tag parameters](#find-double-quotes-inside-double-quotes-in-liquid-tag-parameters)
    * [Replace named HTML entities with numeric entities](#replace-named-entities-with-numeric-entities)
    * [Replace named HTML entities with numeric entities](#replace-named-html-entities-with-numeric-entities)

    My usual process for moving a book manuscript from Word to kramdown involves:

  5. @arthurattwell arthurattwell revised this gist Jul 4, 2018. 1 changed file with 24 additions and 0 deletions.
    24 changes: 24 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -12,6 +12,7 @@
    * [Replace single line breaks, keeping empty lines](#replace-single-line-breaks-keeping-empty-lines)
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-close-up-double-hash-headings-with--kramdown-headings)
    * [Find double quotes inside double quotes in Liquid tag parameters](#find-double-quotes-inside-double-quotes-in-liquid-tag-parameters)
    * [Replace named HTML entities with numeric entities](#replace-named-entities-with-numeric-entities)

    My usual process for moving a book manuscript from Word to kramdown involves:

    @@ -164,3 +165,26 @@ E.g. you may need to debug Liquid tags like `{% include figure markdown="This is
    ```
    \n*\s*\w+=".+".+"
    ```

    ## Replace named HTML entities with numeric entities

    EPUB3 does not allow named HTML entities (e.g. ` `), only numeric ones (` `). That's a pity because named entities are easier for humans to remember. Where kramdown converts markdown to HTML, kramdown by default replaces both entities with actual unicode characters. But kramdown doesn't reach into block-level elements in HTML islands (actual HTML code inside your markdown file), unless you add the attribute `markdown="1"` to the element's tag.

    In tables, this is a PITA, because you'd have to add `markdown="1"` to every `<td>` that contained a named entity (you can't apply the attribute to the parent `<table>`) and hope that processing its content as markdown won't have unexpected side effects.

    The sensible solution is to just replace named entities in tables with numeric entities. This most often happens with `&nbsp;` and `&shy;`. Here is the regex for that. This will find every table with an `&nbsp` in it and replace it with the numeric equivalent.

    Note that this will replace only the first `&nbsp;` in each table. If a table has more than one `&nbsp` in it, you will have to run this find-and-replace again for each one.

    ```
    (?s)(<table((?!</?table>).)*)&nbsp;(((?!</?table>).)*</table>)
    ```

    ```
    \1&#160;\3
    ```

    You can use the same strings for, say `&shy;` if you:

    1. replace `&nbsp;` with `&shy;` in the find string
    2. replace `&#160;` with `&#173;` in the replace string.
  6. @arthurattwell arthurattwell revised this gist Jun 7, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -121,7 +121,7 @@ The replace regex is simple enough to edit. E.g. change to one asterisk for an `
    **Remember to turn on the 'Preserve case' option in your editor before using this**

    ```
    \n([\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
    \n([\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
    \n**\1\2**{:.yourclass}
    ```

  7. @arthurattwell arthurattwell revised this gist Jan 23, 2018. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -11,6 +11,7 @@
    * [Find URLs](#find-urls)
    * [Replace single line breaks, keeping empty lines](#replace-single-line-breaks-keeping-empty-lines)
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-close-up-double-hash-headings-with--kramdown-headings)
    * [Find double quotes inside double quotes in Liquid tag parameters](#find-double-quotes-inside-double-quotes-in-liquid-tag-parameters)

    My usual process for moving a book manuscript from Word to kramdown involves:

  8. @arthurattwell arthurattwell revised this gist Jan 23, 2018. 1 changed file with 8 additions and 0 deletions.
    8 changes: 8 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -155,3 +155,11 @@ This finds, at the start of a line, one or more hashes, then a string (hopefully
    \n([#]+)([^#]+)([#]+)\n
    \n\1 \2\n
    ```

    ## Find double quotes inside double quotes in Liquid tag parameters

    E.g. you may need to debug Liquid tags like `{% include figure markdown="This is a "figure"" %}`

    ```
    \n*\s*\w+=".+".+"
    ```
  9. @arthurattwell arthurattwell revised this gist Nov 8, 2017. 1 changed file with 13 additions and 0 deletions.
    13 changes: 13 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -53,6 +53,8 @@ Then repeat.

    This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because `*this *is broken` while `*this* is correct`.

    ### Single-word search

    Note: this regex only finds single-word instances of this problem, not phrases. E.g. it will fix `We watched *Oliver *today`, but not `We watched *Oliver Twist *today`.

    ```
    @@ -62,6 +64,17 @@ Note: this regex only finds single-word instances of this problem, not phrases.

    This regex (1) finds one or more asterisks followed by (2) any word character or punctuation (except asterisks), followed by (3) one or more spaces, followed by (4) one or more asterisks. The replace simply switches the space and the final asterisks.

    ### Phrase search

    This regex is more powerful and finds the same problem but in words or phrases. We haven't tested it a lot, so don't use it for global replaces: eyeball every change it makes.

    ```
    ((?<=\s)|(?<=^))(\*+[\w !"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)(\s+)(\*+)
    \2\4\3
    ```

    This regex works the same way as single-word search, except that it allows spaces in the matching phrase, and looks for the presence of either a beginning of line or another space before the first asterisk.

    ## Remove image width and height inherited from images in docx

    ```
  10. @arthurattwell arthurattwell revised this gist Nov 8, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -53,7 +53,7 @@ Then repeat.

    This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because `*this *is broken` while `*this* is correct`.

    Note: this regex only finds single-word instances of this problem not phrases. E.g. is will fix `We watched *Oliver *today`, not `We watched *Oliver Twist *today`.
    Note: this regex only finds single-word instances of this problem, not phrases. E.g. it will fix `We watched *Oliver *today`, but not `We watched *Oliver Twist *today`.

    ```
    (\*+[\w!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)(\s+)(\*+)
  11. @arthurattwell arthurattwell revised this gist Nov 8, 2017. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -53,6 +53,8 @@ Then repeat.

    This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because `*this *is broken` while `*this* is correct`.

    Note: this regex only finds single-word instances of this problem not phrases. E.g. is will fix `We watched *Oliver *today`, not `We watched *Oliver Twist *today`.

    ```
    (\*+[\w!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)(\s+)(\*+)
    \1\3\2
  12. @arthurattwell arthurattwell revised this gist Nov 1, 2017. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -102,6 +102,8 @@ To change the number of words it selects, change the `3` in braces, near the end

    The replace regex is simple enough to edit. E.g. change to one asterisk for an `em` span, and of course change `yourclass` to the class you need.

    **Remember to turn on the 'Preserve case' option in your editor before using this**

    ```
    \n([\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’‘!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
    \n**\1\2**{:.yourclass}
  13. @arthurattwell arthurattwell revised this gist Dec 7, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -22,7 +22,7 @@ In each example below, the first line is *find*, the second line *replace* (unle

    **Never use these to replace-all in a long manuscript in one automated step. They are for quickly moving through a manuscript, where you visually confirm every replace.**

    # Copy-paste to split book into separate chapter files
    ## Copy-paste to split book into separate chapter files

    I usually do this at the end, once I've cleaned up the markdown for a whole book in one file. First I make sure I've put YAML frontmatter markers – two lines of `---` – at the start of every file-to-be. It doesn't matter if they contain YAML or not.

  14. @arthurattwell arthurattwell revised this gist Dec 7, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    # Useful regex during manuscript cleanup

    *
    * [Copy-paste to split book into separate chapter files](#copy-paste-to-split-book-into-separate-chapter-files)
    * [Fix inline spans broken by a space after a word before the closing `*`](#fix-inline-spans-broken-by-a-space-after-a-word-before-the-closing-)
    * [Remove image width and height inherited from images in docx](#remove-image-width-and-height-inherited-from-images-in-docx)
    * [Wrap all images in an Electric Book figure blockquote](#wrap-all-images-in-an-electric-book-figure-blockquote)
  15. @arthurattwell arthurattwell revised this gist Dec 7, 2016. 1 changed file with 28 additions and 0 deletions.
    28 changes: 28 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,6 @@
    # Useful regex during manuscript cleanup

    *
    * [Fix inline spans broken by a space after a word before the closing `*`](#fix-inline-spans-broken-by-a-space-after-a-word-before-the-closing-)
    * [Remove image width and height inherited from images in docx](#remove-image-width-and-height-inherited-from-images-in-docx)
    * [Wrap all images in an Electric Book figure blockquote](#wrap-all-images-in-an-electric-book-figure-blockquote)
    @@ -21,6 +22,33 @@ In each example below, the first line is *find*, the second line *replace* (unle

    **Never use these to replace-all in a long manuscript in one automated step. They are for quickly moving through a manuscript, where you visually confirm every replace.**

    # Copy-paste to split book into separate chapter files

    I usually do this at the end, once I've cleaned up the markdown for a whole book in one file. First I make sure I've put YAML frontmatter markers – two lines of `---` – at the start of every file-to-be. It doesn't matter if they contain YAML or not.

    ```
    (?s)^(---)$.+?^(---)$.+?(?=^---$)
    ```

    Explanation:

    * `(?s)` says this regex will match newline characters when we say 'match anything'.
    * `^(---)$` matches `---` if it starts and ends a line, i.e. it's the only thing on the line.
    * `.+?` matches one or more of anything else, non-greedily, i.e. until it finds what matches next.
    * `^(---)$` again, matches `---` on its own on a line.
    * `.+?` again, matches one or more of anything else.
    * `(?=^---$)` says stop when you see three hyphens on their own line again (i.e. the next doc's frontmatter).

    In short, it selects the content from one YAML frontmatter block (two lines of three hyphens, which may or may not have YAML between them) until the next YAML block begins.

    So if you have a file of markdown and you want to split it into separate files, first add the two `---`s at the start of each piece of content you want in a new file, then use this regex to select it and the content, up till the next YAML block. Note it will not find the last YAML block and its content.

    In Sublime Text in Windows, F3 will select the next found text. So to find and create a new file, I press: F3 > Ctrl X > Ctrl N > Ctrl V > Ctrl S [save to filename] > Ctrl W. You can get pretty quick at that, especially if you hold down Ctrl.

    This will find, cut, open a new file, paste the contents, save them (prompting you for a filename), and close the new file.

    Then repeat.

    ## Fix inline spans broken by a space after a word before the closing `*`

    This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because `*this *is broken` while `*this* is correct`.
  16. @arthurattwell arthurattwell revised this gist Nov 17, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -9,7 +9,7 @@
    * [Find an email address](#find-an-email-address)
    * [Find URLs](#find-urls)
    * [Replace single line breaks, keeping empty lines](#replace-single-line-breaks-keeping-empty-lines)
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-closeup-double-hash-headings-with-kramdown-headings)
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-close-up-double-hash-headings-with--kramdown-headings)

    My usual process for moving a book manuscript from Word to kramdown involves:

  17. @arthurattwell arthurattwell revised this gist Nov 17, 2016. 1 changed file with 10 additions and 0 deletions.
    10 changes: 10 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -9,6 +9,7 @@
    * [Find an email address](#find-an-email-address)
    * [Find URLs](#find-urls)
    * [Replace single line breaks, keeping empty lines](#replace-single-line-breaks-keeping-empty-lines)
    * [Replace `##Close-up double-hash headings###` with `## kramdown headings`](#replace-closeup-double-hash-headings-with-kramdown-headings)

    My usual process for moving a book manuscript from Word to kramdown involves:

    @@ -100,3 +101,12 @@ You can use it to quickly move through a text, replacing where necessary with ma
    (\S.*?)\R(.*?\S)
    $1 $2
    ```

    ## Replace `##Close-up double-hash headings###` with `## kramdown headings`

    This finds, at the start of a line, one or more hashes, then a string (hopefully heading text), then another string of hashe before a line ending. It replaces it with the same number of hashes at the start, a space before the heading text, and no trailing hashes.

    ```
    \n([#]+)([^#]+)([#]+)\n
    \n\1 \2\n
    ```
  18. @arthurattwell arthurattwell revised this gist Nov 16, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -74,7 +74,7 @@ To change the number of words it selects, change the `3` in braces, near the end
    The replace regex is simple enough to edit. E.g. change to one asterisk for an `em` span, and of course change `yourclass` to the class you need.

    ```
    \n([\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
    \n([\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
    \n**\1\2**{:.yourclass}
    ```

  19. @arthurattwell arthurattwell revised this gist Nov 16, 2016. 1 changed file with 6 additions and 4 deletions.
    10 changes: 6 additions & 4 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -67,13 +67,15 @@ Note: afterwards, do a manual search for `^`, because if in the docx source a fo

    This particular regex finds the first five words of a paragraph and wraps them in a `strong` span with a kramdown class attribute of `myclass` (i.e. `<strong class="myclass">`).

    To change the number of words it selects, change the `4` in braces, about halfway through the regex. It should be one less than the number of words you want to select.
    It looks for a line break, then a word, then three words preceded by a space. Then the replace wraps that all in double asterisks with a kramdown inline attribute.

    The replace regex is simple enough to edit. E.g. change to one asterisk for an `em` span, and of course change `myclass` to the class you need.
    To change the number of words it selects, change the `3` in braces, near the end of the regex. It should be one less than the number of words you want to select.

    The replace regex is simple enough to edit. E.g. change to one asterisk for an `em` span, and of course change `yourclass` to the class you need.

    ```
    \n([\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+\s){4}[\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+
    \n**\1**{:.myclass}
    \n([\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)((\s[\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+){3})
    \n**\1\2**{:.yourclass}
    ```

    ## Find an email address
  20. @arthurattwell arthurattwell revised this gist Oct 19, 2016. 1 changed file with 13 additions and 3 deletions.
    16 changes: 13 additions & 3 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,15 @@
    # Useful regex during manuscript cleanup

    * [Fix inline spans broken by a space after a word before the closing `*`](#fix-inline-spans-broken-by-a-space-after-a-word-before-the-closing-)
    * [Remove image width and height inherited from images in docx](#remove-image-width-and-height-inherited-from-images-in-docx)
    * [Wrap all images in an Electric Book figure blockquote](#wrap-all-images-in-an-electric-book-figure-blockquote)
    * [Simplify indentation in lists by reducing space after list marker to one space](#simplify-indentation-in-lists-by-reducing-space-after-list-marker-to-one-space)
    * [Remove non-kramdown markdown `^` around superscripts after numbers](#remove-non-kramdown-markdown--around-superscripts-after-numbers)
    * [Wrap the opening words of a paragraph in a span](#wrap-the-opening-words-of-a-paragraph-in-a-span)
    * [Find an email address](#find-an-email-address)
    * [Find URLs](#find-urls)
    * [Replace single line breaks, keeping empty lines](#replace-single-line-breaks-keeping-empty-lines)

    My usual process for moving a book manuscript from Word to kramdown involves:

    1. Convert Word to markdown using Pandoc. To make this easy, I use [this batch script](https://gist.github.com/arthurattwell/44713ec1a870c075eb5e8d7c3ef600ee) in Windows.
    @@ -66,21 +76,21 @@ The replace regex is simple enough to edit. E.g. change to one asterisk for an `
    \n**\1**{:.myclass}
    ```

    ### Find an email address
    ## Find an email address

    [See this post for details](http://www.regular-expressions.info/email.html):

    ```
    \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b
    ```

    ### Find URLs
    ## Find URLs

    URLs are really hard to find. [This gist from John Gruber](https://gist.github.com/gruber/8891611) is your best best.

    You can use it to quickly move through a text, replacing where necessary with markdown links, e.g. usng a replace like `\[\1\](http://\1)`.

    ### Replace single line breaks, keeping empty lines
    ## Replace single line breaks, keeping empty lines

    [This post explains](http://stackoverflow.com/questions/10464735/remove-single-line-breaks-keep-empty-lines):

  21. @arthurattwell arthurattwell revised this gist Oct 19, 2016. 1 changed file with 23 additions and 0 deletions.
    23 changes: 23 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -65,3 +65,26 @@ The replace regex is simple enough to edit. E.g. change to one asterisk for an `
    \n([\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+\s){4}[\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+
    \n**\1**{:.myclass}
    ```

    ### Find an email address

    [See this post for details](http://www.regular-expressions.info/email.html):

    ```
    \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b
    ```

    ### Find URLs

    URLs are really hard to find. [This gist from John Gruber](https://gist.github.com/gruber/8891611) is your best best.

    You can use it to quickly move through a text, replacing where necessary with markdown links, e.g. usng a replace like `\[\1\](http://\1)`.

    ### Replace single line breaks, keeping empty lines

    [This post explains](http://stackoverflow.com/questions/10464735/remove-single-line-breaks-keep-empty-lines):

    ```
    (\S.*?)\R(.*?\S)
    $1 $2
    ```
  22. @arthurattwell arthurattwell revised this gist Oct 19, 2016. 1 changed file with 13 additions and 0 deletions.
    13 changes: 13 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -52,3 +52,16 @@ Note the space at the end of the replace expression.
    ```

    Note: afterwards, do a manual search for `^`, because if in the docx source a following character was mistakenly made superscript, too (e.g. `3^rd)^`), this regex won't find it.

    ## Wrap the opening words of a paragraph in a span

    This particular regex finds the first five words of a paragraph and wraps them in a `strong` span with a kramdown class attribute of `myclass` (i.e. `<strong class="myclass">`).

    To change the number of words it selects, change the `4` in braces, about halfway through the regex. It should be one less than the number of words you want to select.

    The replace regex is simple enough to edit. E.g. change to one asterisk for an `em` span, and of course change `myclass` to the class you need.

    ```
    \n([\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+\s){4}[\w’!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+
    \n**\1**{:.myclass}
    ```
  23. @arthurattwell arthurattwell created this gist Oct 11, 2016.
    54 changes: 54 additions & 0 deletions regex-manuscript-cleanup.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,54 @@
    # Useful regex during manuscript cleanup

    My usual process for moving a book manuscript from Word to kramdown involves:

    1. Convert Word to markdown using Pandoc. To make this easy, I use [this batch script](https://gist.github.com/arthurattwell/44713ec1a870c075eb5e8d7c3ef600ee) in Windows.
    2. Run a series of regex search-and-replaces. These vary from job to job, to suit the book. This document lists common ones. I use Sublime Text for this, but these should work in most good text editors (e.g. Atom, Brackets).
    3. Manually fix and improve the markdown referring visually to the source Word document. This takes a human, because many authors use formatting for semantic purposes and that formatting doesn't convert to markdown.

    In each example below, the first line is *find*, the second line *replace* (unless replace should be blank to delete content).

    **Never use these to replace-all in a long manuscript in one automated step. They are for quickly moving through a manuscript, where you visually confirm every replace.**

    ## Fix inline spans broken by a space after a word before the closing `*`

    This regex finds a very common problem with conversion from MS Word (and other similar programs), where italics have been applied not just to a word but to the space after the word. This happens in Word when a user double-clicks a word to highlight it, before making it italic. When double-clicking a word in Word, Word also highlights the space after the word. When this converts to markdown, the markdown syntax breaks, because `*this *is broken` while `*this* is correct`.

    ```
    (\*+[\w!"\#$%&'()+,\-./:;<=>?@\[\\\]^_`{|}~]+)(\s+)(\*+)
    \1\3\2
    ```

    This regex (1) finds one or more asterisks followed by (2) any word character or punctuation (except asterisks), followed by (3) one or more spaces, followed by (4) one or more asterisks. The replace simply switches the space and the final asterisks.

    ## Remove image width and height inherited from images in docx

    ```
    \{width.+?\}
    \n
    ```

    ## Wrap all images in an Electric Book figure blockquote

    ```
    \n(!.+)
    \n> \1\n{:.figure}\n
    ```

    ## Simplify indentation in lists by reducing space after list marker to one space

    ```
    \n-\s+
    \n-
    ```

    Note the space at the end of the replace expression.

    ## Remove non-kramdown markdown `^` around superscripts after numbers

    ```
    (\d)\^(th|nd|st|rd)\^
    \1\2
    ```

    Note: afterwards, do a manual search for `^`, because if in the docx source a following character was mistakenly made superscript, too (e.g. `3^rd)^`), this regex won't find it.