Skip to content

Instantly share code, notes, and snippets.

@raymelon
Forked from aembleton/docx2md.md
Created December 8, 2017 02:06
Show Gist options
  • Select an option

  • Save raymelon/771e5dbe6584b8451a8a73b2ebc8401f to your computer and use it in GitHub Desktop.

Select an option

Save raymelon/771e5dbe6584b8451a8a73b2ebc8401f to your computer and use it in GitHub Desktop.

Revisions

  1. @aembleton aembleton revised this gist Aug 11, 2015. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions docx2md.md
    Original file line number Diff line number Diff line change
    @@ -4,6 +4,9 @@

    A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

    ## Installing Pandoc
    On a mac you can use [homebrew](http://brew.sh/) by running the command `brew install pandoc`.

    ## The Solution

    As it turns out, there are several open-source tools that allow for conversion between file types. [Pandoc](johnmacfarlane.net/pandoc/) is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." Pandoc can convert from markdown into .docx, and it also works in the other direction.
  2. @aembleton aembleton revised this gist Aug 11, 2015. 1 changed file with 3 additions and 8 deletions.
    11 changes: 3 additions & 8 deletions docx2md.md
    Original file line number Diff line number Diff line change
    @@ -1,22 +1,17 @@
    # Converting a Word Document to Markdown in Two Moves
    # Converting a Word Document to Markdown in One Move

    ## The Problem

    A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

    ## The Solution

    As it turns out, there are several open-source tools that allow for conversion between file types. [Pandoc](johnmacfarlane.net/pandoc/) is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

    Then I found [unoconv](http://dag.wieers.com/home-made/unoconv/). This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

    But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)
    As it turns out, there are several open-source tools that allow for conversion between file types. [Pandoc](johnmacfarlane.net/pandoc/) is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." Pandoc can convert from markdown into .docx, and it also works in the other direction.

    ## Example

    Say you have the Council Rules in a Word Document named "test.docx." [(For a real-life example, visit http://github.com/vzvenyach/Council_Rules/).](http://github.com/vzvenyach/Council_Rules/) Now, you run the following at the command line:

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
    pandoc -f docx -t markdown -o test.md test.docx

    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html)
  3. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion docx2md.md
    Original file line number Diff line number Diff line change
    @@ -10,7 +10,7 @@ As it turns out, there are several open-source tools that allow for conversion b

    Then I found [unoconv](http://dag.wieers.com/home-made/unoconv/). This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

    But, by using unconv and pandoc in combination, you can get a pretty clean output.
    But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)

    ## Example

  4. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion docx2md.md
    Original file line number Diff line number Diff line change
    @@ -19,4 +19,4 @@ Say you have the Council Rules in a Word Document named "test.docx." [(For a rea
    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html

    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html).
    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html)
  5. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 5 additions and 1 deletion.
    6 changes: 5 additions & 1 deletion docx2md.md
    Original file line number Diff line number Diff line change
    @@ -14,5 +14,9 @@ But, by using unconv and pandoc in combination, you can get a pretty clean outpu

    ## Example

    Say you have the Council Rules in a Word Document named "test.docx." [(For a real-life example, visit http://github.com/vzvenyach/Council_Rules/).](http://github.com/vzvenyach/Council_Rules/) Now, you run the following at the command line:

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
    pandoc -f html -t markdown -o test.md test.html

    Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. [In the end, here's my nicely rendered version of the Rules.](http://vzvenyach.github.io/Council_Rules/Rules.html).
  6. vzvenyach revised this gist Nov 2, 2013. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions docx2md.md
    Original file line number Diff line number Diff line change
    @@ -14,5 +14,5 @@ But, by using unconv and pandoc in combination, you can get a pretty clean outpu

    ## Example

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html
  7. vzvenyach created this gist Nov 2, 2013.
    18 changes: 18 additions & 0 deletions docx2md.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,18 @@
    # Converting a Word Document to Markdown in Two Moves

    ## The Problem

    A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

    ## The Solution

    As it turns out, there are several open-source tools that allow for conversion between file types. [Pandoc](johnmacfarlane.net/pandoc/) is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

    Then I found [unoconv](http://dag.wieers.com/home-made/unoconv/). This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

    But, by using unconv and pandoc in combination, you can get a pretty clean output.

    ## Example

    unoconv -f html test.docx
    pandoc -f html -t markdown -o test.md test.html