Skip to content

Instantly share code, notes, and snippets.

@jimjkelly
Created July 22, 2013 15:43
Show Gist options
  • Save jimjkelly/6054893 to your computer and use it in GitHub Desktop.
Save jimjkelly/6054893 to your computer and use it in GitHub Desktop.
This gist shows how to properly handle encoding issues in Python 2.x
# All data coming across the intarwebs is encoded in a file encoding.
# This could be ASCII, UTF-8, UTF-16, Shift-JIS, etc. To properly
# handle data, you need to know the encoding. Thankfully on the web
# the de facto standard seems to be moving towards UTF-8.
#
# In order to safely deal with data - you want to decode this encoded
# data (referred to in Python world as a byte string) from its
# encoding to the generic unicode data type - Python can
# safely work with this in all situations. Let's pretend we
# have some data foo we have just read in from the intarwebs
bar = foo.decode('utf-8')
# bar is no safe to work with - no UnicodeDecodeErrors! When working
# with hard coded text strings, it's always good to write them like
# this so they are unicode and not byte strings:
hello = u'hello' # good!
goodbye = 'goodbye' # bad!
# The other thing you need to know is that when you send data out
# of your program you need to now *encode* it from its unicode
# representation to an encoding. Once again, utf-8 is always
# a fine choice
print bar.encode('utf-8')
with open('output.txt', 'w') as fp:
fp.write(bar.encode('utf-8'))
# And that's basically it - the key is to know that at the edges
# of your program, ie as data is brought in or sent out, you should
# be encoding/decoding, and only working with unicode internally.
# It's a bit clunky, but once you get used to it and act in the
# manner above, it's nice because it's all very deliberate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment