Python 2.7 strings and text

In python, the unicode type stores an abstract sequence of code points where as the str type stores a sequence of raw bytes. A unicode encoding (UTF-8, UTF-7, UTF-16, UTF-32, etc) maps sequences of bytes (str) to the unicode code points (unicode) and is used to convert between the two.

Converting between strings and unicode

str.decode converts raw bytes to unicode using a supplied encoding, it is called decode as the raw bytes are seen as an encoded version of the pure unicode code points. Conversely unicode.encode converts unicode code points to an str using a supplied encoding, it is called encode as the unicode code points are seen as being encoded into a stream of raw bytes.

e.g. encoding unicode to a string

# coding: utf-8
u = u'∀x ∃y P(x,y)'

with open( 'written.txt', 'w' ) as f:
    s = u.encode('utf-8')
    f.write( s )

The # coding: utf-8 is to tell the python interpreter what coding has been used to stored the unicode characters in the string literal.

e.g. decoding a string to unicode

with open( 'Unicode.txt', 'r' ) as f:
    s = f.read()
    print s.decode('utf-8')

Type coercion

When an instance of a str and unicode type are combined in a statement Python will convert the str to unicode (It assumes the bytes in the str are ascii).

>>> s = 'This is a str'
>>> type(s)
<type 'str'>
>>> u = u'This is unicode'
>>> type(u)
<type 'unicode'>
>>> type( s + u )
<type 'unicode'>

Immutability

str and unicode are immutable once created they can not be changed.

>>> a = u'This is unicode'
>>> b = a
>>> b is a
True
>>> b += u' b'
>>> b is a
False
>>> a
u'This is unicode'
>>> b
u'This is unicode b'

Interning

There is support for interning str but not unicode.

>>> s1 = 'foo!'
>>> s2 = 'foo!'
>>> s1 is s2
False
>>> s1 = intern('foo!')
>>> s2 = intern('foo!')
>>> s1 is s2
True

Text manipulation

In general unicode should be used when dealing with text manipulation ( finding sub strings, splitting on word boundaries, etc ) and str should be used when dealing with I/O. You should decode to unicode as early as possible, perform all text manipulation on the unicode and then encode as late as possible preferably just before the IO operations.

String literals

String literals can be enclosed in either single or double quotes and the choice is arbitrary but when using a single quote as a terminator you can use the double quotation mark inside the string without having to escape it and vice versa.

e.g.

a = "this has a single quote ' "
b = "this has a double quote \" "

Special characters are introduce via escaping (via a backslash) within both single and double quoted literals – e.g. \n ' " \

By default string literals are created as an instance of str, to follow the advice that text manipulation should always be done with unicode you can specify that a string literal is created as unicode by prefixing the literal with a u e.g.

u1 = u'This is a unicode string literal.'

Raw string literal

A raw string does not treat \ as an escape character, so a raw string of ‘a\nb’ evaluates to a length of 4 where as an ordinary string of ‘a\nb’ would evaluate to a length of 3. You can specify that a string literal is treat as a raw string by prefixing it with an r e.g.

>>> s1 = ur'x\nx'
>>> s1
u'x\\nx'
>>> s2 = u'x\nx'
>>> s2
u'x\nx'

When a raw string is displayed in the console the \ will be escaped.