2017-9-28 Python 编码之禅

情景描述:

平时工作中经常碰到编码、解码、乱码……类似的问题不胜其烦,如街边小广告一般异常讨厌,需要花时间好好整理一番,“一”绝后患。

str(s)与unicode(s)

str(s)和unicode(s)是两个工厂方法,分别返回str字符串对象和unicode字符串对象;
str(s)是s.encode(‘ascii’)的简写;
unicode(s)是s.decode(‘ascii’)的简写;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
str
str(object='')
str(object=b'', encoding='utf-8', errors='strict')

object - object whose informal representation is to be returned
encoding - Defaults of UTF-8. Encoding of the given object
errors - response when decoding fails. There are six types of error response:
strict - default response which raises a UnicodeDecodeError exception on failure
ignore - ignores the unencodable unicode from the result
replace - replaces the unencodable unicode to a question mark ?
xmlcharrefreplace - inserts XML character reference instead of unencodable unicode
backslashreplace - inserts a \uNNNN espace sequence instead of unencodable unicode
namereplace - inserts a \N{...} escape sequence instead of unencodable unicode


>>> s3 = u"你好"
>>> s3
u'\u4f60\u597d'
>>> str(s3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

上面s3是unicode类型的字符串,str(s3)相当于是执行s3.encode(‘ascii’)因为“你好”两个汉字不能用ascii码来表示,所以就报错了,指定正确的编码:s3.encode(‘gbk’)或者s3.encode("utf-8")就不会出现这个问题了。

类似的unicode有同样的错误:
>>> s4 = "你好"
>>> unicode(s4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

unicode(s4)等效于s4.decode(‘ascii’),因此要正确的转换就要正确指定其编码s4.decode(‘gbk’)或者s4.decode("utf-8")。

###

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
In [20]: '中文'
Out[20]: '\xe4\xb8\xad\xe6\x96\x87'

In [21]: u'中文'
Out[21]: u'\u4e2d\u6587'

In [29]: print '中文'
中文

In [30]: print u'中文'
中文

In [34]: print '\u4e2d\u6587'
\u4e2d\u6587

In [35]: print u'\u4e2d\u6587'
中文

In [26]: u'中文'.encode('gb2312')
Out[26]: '\xd6\xd0\xce\xc4'

In [27]: u'中文'.encode('gbk')
Out[27]: '\xd6\xd0\xce\xc4'

In [28]: u'中文'.encode('utf8')
Out[28]: '\xe4\xb8\xad\xe6\x96\x87'

问题:

1
2
3
4
5
6
7
In [41]: '中文'.encode('utf8')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-41-94bb800b6371> in <module>()
----> 1 '中文'.encode('utf8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

解决办法:

1
2
3
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

支持小徐?賞一賞!