Skip to content

Instantly share code, notes, and snippets.

@vsujeesh
Forked from danmou/onenote_export.py
Created June 8, 2023 15:37
Show Gist options
  • Save vsujeesh/18361ce50bbb63ac6b6b9dcdac9315f8 to your computer and use it in GitHub Desktop.
Save vsujeesh/18361ce50bbb63ac6b6b9dcdac9315f8 to your computer and use it in GitHub Desktop.

Revisions

  1. @danmou danmou revised this gist Jan 9, 2020. No changes.
  2. @danmou danmou revised this gist Jun 17, 2019. 1 changed file with 7 additions and 0 deletions.
    7 changes: 7 additions & 0 deletions onenote_export.py
    Original file line number Diff line number Diff line change
    @@ -32,6 +32,13 @@
    # (This does not give any third parties access to your data, as long as you don't share the client id
    # and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.

    ## Note
    # Microsoft limits how many requests you can do within a given time period. Therefore, if you have many
    # notes you might eventually see messages like this in the terminal: "Too many requests, waiting 20s and
    # trying again." This is not a problem, but it means the entire process can take a while. Also, the login
    # session can expire after a while, which results in a TokenExpiredError. If this happens, simply reload
    # `http://localhost:5000` and the script will continue (skipping the files it already downloaded).

    client_id = '...'
    secret = '...'

  3. @danmou danmou revised this gist Jun 17, 2019. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions onenote_export.py
    Original file line number Diff line number Diff line change
    @@ -215,6 +215,7 @@ def main_logic():
    with open(out_html, "w") as f:
    f.write(content)

    print("Done!")
    return flask.render_template_string('<html><head><title>Done</title></head><body><p1><b>Done</b></p1></body></html>')

    if __name__ == "__main__":
  4. @danmou danmou revised this gist Jun 17, 2019. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions onenote_export.py
    Original file line number Diff line number Diff line change
    @@ -32,6 +32,9 @@
    # (This does not give any third parties access to your data, as long as you don't share the client id
    # and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.

    client_id = '...'
    secret = '...'

    import os
    import random
    import re
    @@ -47,8 +50,6 @@
    import msal
    from requests_oauthlib import OAuth2Session

    client_id = '...'
    secret = '...'
    output_path = Path('output')
    graph_url = 'https://graph.microsoft.com/v1.0'
    authority_url = 'https://login.microsoftonline.com/common'
  5. @danmou danmou revised this gist Jun 17, 2019. 1 changed file with 20 additions and 6 deletions.
    26 changes: 20 additions & 6 deletions onenote_export.py
    Original file line number Diff line number Diff line change
    @@ -2,21 +2,35 @@
    # This Python scripts exports all the OneNote notebooks linked to your Microsoft account to HTML files.

    ## Output
    # The notebooks will each become a subdirectory of the `output` folder, with further subdirectories for the sections within each notebook and the pages within each section. Each page is a directory containing the HTML file `main.html` and two directories `images` and `attachments` (if necessary) for the images and attachments. Any sub-pages will be subdirectories within this one.
    # The notebooks will each become a subdirectory of the `output` folder, with further subdirectories
    # for the sections within each notebook and the pages within each section. Each page is a directory
    # containing the HTML file `main.html` and two directories `images` and `attachments` (if necessary)
    # for the images and attachments. Any sub-pages will be subdirectories within this one.

    ## Setup
    # In order to run the script, you must first do the following:
    # 1. Go to https://aad.portal.azure.com/ and log in with your Microsoft account.
    # 2. Select "Azure Active Directory" and then "App registrations" under "Manage".
    # 3. Select "New registration". Choose any name, set "Supported account types" to "Accounts in any organizational directory and personal Microsoft accounts" and under "Redirect URI", select Web and enter `http://localhost:5000/getToken`. Register.
    # 3. Select "New registration". Choose any name, set "Supported account types" to "Accounts in any
    # organizational directory and personal Microsoft accounts" and under "Redirect URI", select Web
    # and enter `http://localhost:5000/getToken`. Register.
    # 4. Copy "Application (client) ID" and paste it as `client_id` below in this script.
    # 5. Select "Certificates & secrets" under "Manage". Press "New client secret", choose a name and confirm.
    # 5. Select "Certificates & secrets" under "Manage". Press "New client secret", choose a name and
    # confirm.
    # 6. Copy the client secret and paste it as `secret` below in this script.
    # 7. Select "API permissions" under "Manage". Press "Add a permission", scroll down and select OneNote, choose "Delegated permissions" and check "Notes.Read" and "Notes.Read.All". Press "Add permissions".
    # 8. Make sure you have Python 3.7 (or newer) installed and install the dependencies using the command `pip install flask msal requests_oauthlib`.
    # 7. Select "API permissions" under "Manage". Press "Add a permission", scroll down and select OneNote,
    # choose "Delegated permissions" and check "Notes.Read" and "Notes.Read.All". Press "Add
    # permissions".
    # 8. Make sure you have Python 3.7 (or newer) installed and install the dependencies using the command
    # `pip install flask msal requests_oauthlib`.

    ## Running
    # In a terminal, navigate to the directory where this script is located and run it using `python onenote_export.py`. This will start a local web server on port 5000. In your browser navigate to http://localhost:5000 and log in to your Microsoft account. The first time you do it, you will also have to accept that the app can read your OneNote notes. (This does not give any third parties access to your data, as long as you don't share the client id and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.
    # In a terminal, navigate to the directory where this script is located and run it using
    # `python onenote_export.py`. This will start a local web server on port 5000.
    # In your browser navigate to http://localhost:5000 and log in to your Microsoft account.
    # The first time you do it, you will also have to accept that the app can read your OneNote notes.
    # (This does not give any third parties access to your data, as long as you don't share the client id
    # and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.

    import os
    import random
  6. @danmou danmou revised this gist Jun 17, 2019. 1 changed file with 23 additions and 7 deletions.
    30 changes: 23 additions & 7 deletions onenote_export.py
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,23 @@
    ### README
    # This Python scripts exports all the OneNote notebooks linked to your Microsoft account to HTML files.

    ## Output
    # The notebooks will each become a subdirectory of the `output` folder, with further subdirectories for the sections within each notebook and the pages within each section. Each page is a directory containing the HTML file `main.html` and two directories `images` and `attachments` (if necessary) for the images and attachments. Any sub-pages will be subdirectories within this one.

    ## Setup
    # In order to run the script, you must first do the following:
    # 1. Go to https://aad.portal.azure.com/ and log in with your Microsoft account.
    # 2. Select "Azure Active Directory" and then "App registrations" under "Manage".
    # 3. Select "New registration". Choose any name, set "Supported account types" to "Accounts in any organizational directory and personal Microsoft accounts" and under "Redirect URI", select Web and enter `http://localhost:5000/getToken`. Register.
    # 4. Copy "Application (client) ID" and paste it as `client_id` below in this script.
    # 5. Select "Certificates & secrets" under "Manage". Press "New client secret", choose a name and confirm.
    # 6. Copy the client secret and paste it as `secret` below in this script.
    # 7. Select "API permissions" under "Manage". Press "Add a permission", scroll down and select OneNote, choose "Delegated permissions" and check "Notes.Read" and "Notes.Read.All". Press "Add permissions".
    # 8. Make sure you have Python 3.7 (or newer) installed and install the dependencies using the command `pip install flask msal requests_oauthlib`.

    ## Running
    # In a terminal, navigate to the directory where this script is located and run it using `python onenote_export.py`. This will start a local web server on port 5000. In your browser navigate to http://localhost:5000 and log in to your Microsoft account. The first time you do it, you will also have to accept that the app can read your OneNote notes. (This does not give any third parties access to your data, as long as you don't share the client id and secret you created on the Azure portal). After this, go back to the terminal to follow the progress.

    import os
    import random
    import re
    @@ -13,8 +33,8 @@
    import msal
    from requests_oauthlib import OAuth2Session

    client_id = '<...>'
    secret = '<...>'
    client_id = '...'
    secret = '...'
    output_path = Path('output')
    graph_url = 'https://graph.microsoft.com/v1.0'
    authority_url = 'https://login.microsoftonline.com/common'
    @@ -139,9 +159,6 @@ def download_attachment(tag_match):
    @app.route("/getToken")
    def main_logic():
    code = flask.request.args['code']
    # state = flask.request.args['state']
    # if state != flask.session['state']:
    # raise ValueError("State does not match")

    token = application.acquire_token_by_authorization_code(code, scopes=scopes,
    redirect_uri=redirect_uri)
    @@ -182,8 +199,7 @@ def main_logic():
    content = download_attachments(graph_client, content, out_dir)
    with open(out_html, "w") as f:
    f.write(content)
    # return flask.render_template('display.html', result=[])
    # return flask.render_template_string(content.text)

    return flask.render_template_string('<html><head><title>Done</title></head><body><p1><b>Done</b></p1></body></html>')

    if __name__ == "__main__":
  7. @danmou danmou created this gist Jun 17, 2019.
    190 changes: 190 additions & 0 deletions onenote_export.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,190 @@
    import os
    import random
    import re
    import shutil
    import string
    import time
    import uuid
    from html.parser import HTMLParser
    from pathlib import Path
    from xml.etree import ElementTree

    import flask
    import msal
    from requests_oauthlib import OAuth2Session

    client_id = '<...>'
    secret = '<...>'
    output_path = Path('output')
    graph_url = 'https://graph.microsoft.com/v1.0'
    authority_url = 'https://login.microsoftonline.com/common'
    scopes = ['Notes.Read', 'Notes.Read.All']
    redirect_uri = 'http://localhost:5000/getToken'

    app = flask.Flask(__name__)
    app.debug = True
    app.secret_key = os.urandom(16)

    application = msal.ConfidentialClientApplication(
    client_id,
    authority=authority_url,
    client_credential=secret
    )


    @app.route("/")
    def main():
    resp = flask.Response(status=307)
    resp.headers['location'] = '/login'
    return resp


    @app.route("/login")
    def login():
    auth_state = str(uuid.uuid4())
    flask.session['state'] = auth_state
    authorization_url = application.get_authorization_request_url(scopes, state=auth_state,
    redirect_uri=redirect_uri)
    resp = flask.Response(status=307)
    resp.headers['location'] = authorization_url
    return resp


    def get_json(graph_client, url, params=None):
    values = []
    next_page = url
    while next_page:
    resp = get(graph_client, next_page, params=params).json()
    if 'value' not in resp:
    raise RuntimeError(f'Invalid server response: {resp}')
    values += resp['value']
    next_page = resp.get('@odata.nextLink')
    return values


    def get(graph_client, url, params=None):
    while True:
    resp = graph_client.get(url, params=params)
    if resp.status_code == 429:
    # We are being throttled due to too many requests.
    # See https://docs.microsoft.com/en-us/graph/throttling
    print(' Too many requests, waiting 20s and trying again.')
    time.sleep(20)
    elif resp.status_code == 500:
    # In my case, one specific note page consistently gave this status
    # code when trying to get the content. The error was "19999:
    # Something failed, the API cannot share any more information
    # at the time of the request."
    print(' Error 500, skipping this page.')
    return None
    else:
    resp.raise_for_status()
    return resp


    def download_attachments(graph_client, content, out_dir):
    image_dir = out_dir / 'images'
    attachment_dir = out_dir / 'attachments'
    # if image_dir.exists():
    # shutil.rmtree(image_dir)

    class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
    self.attrs = {k: v for k, v in attrs}

    def generate_html(tag, props):
    element = ElementTree.Element(tag, attrib=props)
    return ElementTree.tostring(element, encoding='unicode')

    def download_image(tag_match):
    # <img width="843" height="218.5" src="..." data-src-type="image/png" data-fullres-src="..." data-fullres-src-type="image/png" />
    parser = MyHTMLParser()
    parser.feed(tag_match[0])
    props = parser.attrs
    image_url = props.get('data-fullres-src', props['src'])
    image_type = props.get('data-fullres-src-type', props['data-src-type']).split("/")[-1]
    file_name = ''.join(random.choice(string.ascii_lowercase) for _ in range(10)) + '.' + image_type
    img = get(graph_client, image_url).content
    print(f' Downloaded image of {len(img)} bytes.')
    image_dir.mkdir(exist_ok=True)
    with open(image_dir / file_name, "wb") as f:
    f.write(img)
    props['src'] = "images/" + file_name
    props = {k: v for k, v in props.items() if not 'data-fullres-src' in k}
    return generate_html('img', props)

    def download_attachment(tag_match):
    # <object data-attachment="Trig_Cheat_Sheet.pdf" type="application/pdf" data="..." style="position:absolute;left:528px;top:139px" />
    parser = MyHTMLParser()
    parser.feed(tag_match[0])
    props = parser.attrs
    data_url = props['data']
    file_name = props['data-attachment']
    if (attachment_dir / file_name).exists():
    print(f' Attachment {file_name} already downloaded; skipping.')
    else:
    data = get(graph_client, data_url).content
    print(f' Downloaded attachment {file_name} of {len(data)} bytes.')
    attachment_dir.mkdir(exist_ok=True)
    with open(attachment_dir / file_name, "wb") as f:
    f.write(data)
    props['data'] = "attachments/" + file_name
    return generate_html('object', props)

    content = re.sub(r"<img .*?\/>", download_image, content, flags=re.DOTALL)
    content = re.sub(r"<object .*?\/>", download_attachment, content, flags=re.DOTALL)
    return content


    @app.route("/getToken")
    def main_logic():
    code = flask.request.args['code']
    # state = flask.request.args['state']
    # if state != flask.session['state']:
    # raise ValueError("State does not match")

    token = application.acquire_token_by_authorization_code(code, scopes=scopes,
    redirect_uri=redirect_uri)
    graph_client = OAuth2Session(token=token)

    notebooks = get_json(graph_client, f'{graph_url}/me/onenote/notebooks')
    print(f'Got {len(notebooks)} notebooks.')
    for nb in notebooks:
    nb_name = nb["displayName"]
    print(f'Opening notebook {nb_name}')
    sections = get_json(graph_client, nb['sectionsUrl'])
    print(f' Got {len(sections)} sections.')
    for sec in sections:
    sec_name = sec["displayName"]
    print(f' Opening section {sec_name}')
    pages = get_json(graph_client, sec['pagesUrl'] + '?pagelevel=true')
    print(f' Got {len(pages)} pages.')
    pages = sorted([(page['order'], page) for page in pages])
    level_dirs = [None]*4
    for order, page in pages:
    level = page['level']
    page_title = f'{order}_{page["title"]}'
    print(f' Opening page {page_title}')
    if level == 0:
    out_dir = output_path / nb_name / sec_name / page_title
    else:
    out_dir = level_dirs[level - 1] / page_title
    level_dirs[level] = out_dir
    out_html = out_dir / 'main.html'
    if out_html.exists():
    print(' HTML file already exists; skipping this page')
    continue
    out_dir.mkdir(parents=True, exist_ok=True)
    response = get(graph_client, page['contentUrl'])
    if response is not None:
    content = response.text
    print(f' Got content of length {len(content)}')
    content = download_attachments(graph_client, content, out_dir)
    with open(out_html, "w") as f:
    f.write(content)
    # return flask.render_template('display.html', result=[])
    # return flask.render_template_string(content.text)
    return flask.render_template_string('<html><head><title>Done</title></head><body><p1><b>Done</b></p1></body></html>')

    if __name__ == "__main__":
    app.run()