2013-01-03

Authors in a Markov matrix Part 2 (11): Appendix


Appendix A: Unicode and Python 2.7.x


This time I develop python programs. I use python 2.7.3. Handling Unicode was needed to process web pages, not only for Japanese and German web pages, but also for English pages. Because some of the English authors have accent characters.  In the early development stage, I was bothered UnicodeDecodeError and UnicodeEncodeError exceptions. Here I will explain what they are, why they raised, and how to handle them.

How the Unicode encodes characters?


As far as I understand, Unicode uses two maps to encode characters. This depends on how you understand this coding system. I hadn't known this until I worked on this research. My understanding was that there are many kind of Unicode, like UTF-8, UTF-16, UTF-32. But this was my misunderstanding. UTF-8 is how to encode the Unicode data and Unicode is an encoding system how to encode characters. UTF-8 is one of the mapping methods, or transformation formats and UTF-8 is not Unicode (Universal character set). This is cumbersome.

  • Unicode: a map from number to character
  • UTF-X:   a map from Unicode encoded data to a specific data

Unicode itself defines a map from numbers to character descriptions. This in one bijection map. For example, 0x0061 'a'; LATIN SMALL LETTER A is an entry of the map. In this example, the number is 0x0061, this is called ``code point,'' and description of this number is `` 'a'; LATIN SMALL LETTER A.'' Using a map of the description to a font, we can see a letter `a'. The shape of the description is called glyph. Unicode has the bijection map, therefore, we can also say, `a' is map to a code point 0x0061.

This description and code point mapping is Unicode. A character is represented by a number. But, this Unicode's code point map is usually not used. Here ``usually'' means, the Unicode encoded text usually doesn't have this code points. Most of the case, a code point text is converted to UTF-X (UCS (Universal Character Set) Transformation Format X). There are several UTF, for example, UTF-8, UTF-16 with endian information. Most of the case, the Unicode encoded text is converted to one of the UTF, then, these UTF binary is save to your disk. This conversion is common, therefore, I misunderstood that there are many Unicodes, which I thought odd, since Uni means ``one.'' The concept of Unicode I understood was you can use all the characters, no matter which language you are writing, even you can mix any language characters. If there are several kind of Unicode, this coding system doesn't make any sense. This was wrong understanding. Unicode itself is one map. But when you use this coding system, there are several formats. These formats are UTF-X. Why is this so complicated? To keep all the characters in the world needs some space. This means if someone switch to ASCII to Unicode, your file size becomes suddenly four times larger. This is a dilemma: you want to have a big character set, and, you don't want to make your file larger. To solve this dilemma, the second mapping, UTF-X was introduced.

I learn these information from [3].

Python 2.7.x's Unicode representation


Python 2.7.x has two build-in datatype for representing strings: unicode data type and str data type. Both data types can keep the printable strings, however, str is more suitable for ASCII characters, even it can keep any 8-bit binary.  Each data type has a encoding or decoding method to convert the encoding [2]. However, this encoding sometimes causes a problem when you print the unicode data. Figure [8] shows the relationship between str type, unicode type, and encode decode methods.

Figure 8: Relationship between Unicode type and 8-bit str type in Python 2.7.x.

The unicode type of Python 2.7.x has a method encode() and the str type has a method decode(). We can convert each type to the other type via these methods. However, some encoding method can not apply to specific byte sequence, since some byte sequence is not valid for some encoding. For example, `ascii' encoding method doesn't allow when a byte data's 8th bit is on. When we specify the error handling method `strict' for an encoding, the encoding or decoding method may raise an exception. UnicodeEncodeError can be raised by the encoding method, UnicodeDecodeError can be raised by the decoding method. This is a bit cumbersome. Let me show you some examples.

First we define an unicode type strings.
uc = u'Wächter'
print type(uc)
-> <
type 'unicode'>

Let's encode this with `utf-8' encoding to a str type.
s = uc.encode('utf-8', 'ignore')
print type(s)
-> 
<type 'str'>

Python 2.7's print statement accepts str type in default, but not unicode type, therefore, the given unicode data to the print statement will be encoded.
print uc
-> UnicodeEncodeError: 'ascii'
codec can't encode character
u'\xe4' in position 1: ordinal
not in range(128)
uc has a invalid character in ascii code, therefore an exception has been raised. Please note, the error is an encoding error.

However, encoding method has an option to ignore the encoding error.
print uc.encode('utf-8', 'ignore')
-> 'Wächter'
If your terminal accepts the utf-8 encoding, you can see the unicode character.

There is a more complicated case, for example, an encoded str type data is decoded back as the following:
print u'{0}'.format(uc.encode('utf-8', 'ignore'))
-> UnicodeDecodeError: 'ascii'
   codec can't decode byte 0xc3 in
   position 1: ordinal not in range(128)
Here, the format method gets str type data, but, this format method is an unicode type's format method, therefore, this accepts only unicode data. The uc.encode generates str type data, this doesn't fit to the unicode.format method, therefore, decode method is called to generate an unicode data before the format method is called. This decode method cannot handle some unicode character, therefore, the UnicodeDecodeError exception is raised. I was puzzled by this exception why this is not an EncodeError, since it seems only encode method is called. But, actually there is a hidden another decode method call is in this code. To avoid this, we can encode after the format method is called as the following:
print u'\{0\}'.format(uc).encode('utf-8', 'ignore')
-> Wächter
This is a subtle issue, however, we cannot ignore this to write a code that handles unicode.


Appendix B: Contribution to Wikipedia


We found some mistakes in the Wikipedia's authors list as a side effect of this experiment. We have contributed to update the list in Wikipedia.

We needed to generate an adjacency matrix and performed eigenanalysis on the matrix.  We require the independence of the eigenvectors in this analysis. However, it is almost impossible to have such good matrix in our problem setting. Because it is hard to avoid a few problems: a page which doesn't have a link to any other authors, a page which has no reference link from others, mistakes of link duplication of the root page. PageRank algorithm expects this kind of singularity in the adjacency matrix and gave us a solution of this issue.  In our problem settings, we can easily detect the last issue, link duplication of the root page.

I am happy to contribute Wikipedia.


References

[2] Brené Brown, The power of vulnerability,
http://www.ted.com/talks/brene_brown_on_vulnerability.html

[3] Python documentation 2.7.3, Unicode HOWTO, http://docs.python.org/2/howto/unicode.html


No comments: