Skip to main content

Posts

Showing posts from January, 2013

Authors in a Markov matrix Part 2 (11): Appendix

Appendix A: Unicode and Python 2.7.x This time I develop python programs. I use python 2.7.3. Handling Unicode was needed to process web pages, not only for Japanese and German web pages, but also for English pages. Because some of the English authors have accent characters.  In the early development stage, I was bothered UnicodeDecodeError and UnicodeEncodeError exceptions. Here I will explain what they are, why they raised, and how to handle them. How the Unicode encodes characters? As far as I understand, Unicode uses two maps to encode characters. This depends on how you understand this coding system. I hadn't known this until I worked on this research. My understanding was that there are many kind of Unicode, like UTF-8, UTF-16, UTF-32. But this was my misunderstanding. UTF-8 is how to encode the Unicode data and Unicode is an encoding system how to encode characters. UTF-8 is one of the mapping methods, or transformation formats and UTF-8 is not Unicode (Universal ch...

Authors in a Markov matrix Part 2 (10) Experimental results: Which author do people find most inspiring?

Conclusion To find out that which author do people find most inspiring, we used the link structure of Wikipedia. First we extracted the link structure of Wikipedia and create the adjacency matrix, then we apply an eigenanalysis method, which is also called PageRank, to answer the first question. We showed the results of German, English, and Japanese authors.  We also compared the same category (authors), but between the different data source, i.e., different language Wikipedia. We can see the interesting similarity and also difference.  Personally, one of the authors was surprised me that Winston Churchill and Issac Newton have a high ranking score. He didn't know Winston Churchill is the Nobel Prize winner of the literature. Computational literature Recently, I use a mathematical approach or an information scientific approach to understand literature and languages. This approach has a huge limitation, but on the other hand, it gives me some measureable values. Brené...

Authors in a Markov matrix Part 2 (9) Experimental results: Which author do people find most inspiring?

This time is a follow up discussion of the result. No link found problem We have an impression there are some amount of Japanese author links that have no reference page in German Wikipedia. We didn't check the exact numbers, but while we debugged the program, we looked into several pages. A typical no link reference case is, for instance, a page mentioned about 良寛 (Ryōkan) has a link to Ryokan, or Sōseki link to Seseki, and so on. These special characters are often omitted, this causes no link reference found. Cross reference between Wikipedia It was relatively easy to make a cross reference list between English and German Wikipedia results since these Wikipedias share how to write the author names, i.e., using the Latin character set. However, Japanese Wikipedias uses Japanese characters for the author's name. For example, Lowis Carroll is ルイス・キャロル in Japanese Wikipedia. In Japanese Wikipedia has the information also in Latin characters, but, the Wiki page ke...

Authors in a Markov matrix Part 2 (8) Experimental results: Which author do people find most inspiring?

Wikipedia's Category problem The category problem here is: we expect a specific category has some expected authors on the list, but the actual Wikipedia's category doesn't have the authors we expected. This causes some data missing. There are three interesting cases we found in the following subsections. We didn't do any additional process for this problem. For example, ``Shakespeare does not exist as an English writer in the Japanese Wikipedia.'' Since we did nothing for this, there is no Shakespeare in the English author rank table in Japanese Wikipedia in our result. We tried to obtain the data as automatic as possible since this is just our Sunday hobby research project. We didn't spend much time for the fine tuning of these problem. But these are not intuitive (e.g., Shakespeare is not an English author in Japanese Wikipedia.), so how to automatically fill this gap between Wikipedia sense and our intuition is the future work. No Shakespeare in t...