2012-12-26

Authors in a Markov matrix Part 2 (1) Experimental results: Which author do people find most inspiring?


This is the part 2 of the article, experimental results. Until the last article, I talked about the question, ``Which author do people find most inspiring?'' From now on, I would like to talk about an answer.

Analyzing relationships between authors

Author graph generation method

We apply eigenanalysis on Japanese, English, and German authors to find  out which author do people find most inspiring in the literature in a sense of author network topology. First we need to generate an author graph that represents the relationships between authors. Of course we could generate such graph by hand, i.e., researching a lot of documents about authors. However, the number of famous Japanese authors maybe more than 1000. This is just our Sunday fun hobby project, we don't have enough time to do that.

Fortunately, nowadays we can use cloud knowledge. The natural solution seems to be using the information of Wikipedia. We can generate an adjacency matrix from the Wikipedia's link structure, then apply eigenanalysis to analyze the relationships between authors.

Assumption of this experiment

We assume the link structure of author pages in Wikipedia represents the relationships between authors.
This is a debatable assumption. We return to the first question ``What is the relationships between authors?'' in the Part 1 of this article. We define that the relationships of authors are given by the link structure of Wikipedia. Our intuition of this assumption is based on the idea: when a writer of Wikipedia made a link between authors, the writer thought there were some relationships between these authors. If this assumption cannot be accepted, the following experiment has no meaning. So we sometimes say, ``in a sense of Wikipedia link structure, ...'' in this article. So far, we believe this is a good method to find the relationships between authors and we don't have better idea to tackle this problem. When a better method is found, we can discuss this assumption again.

Based on this assumption, we will construct an adjacency matrix based on the link structure of Wikipedia and analyze it by eigenanalysis.

The advantage and disadvantage of this method are:

Advantage:


  1. Data size: We can use a relatively large digital data
  2. Correctness: Wikipedia pages are public and some review has been done
  3. Quality: We can expect there are some meaning in the link structure since these pages are made by human

Disadvantage:


  1. Error possibility: There could be errors in the link structure
  2. Wikipedia writer bias: Some Wikipedia writer may put some kind of bias depends on their preference
  3. Wikipedia edit guideline bias: Wikipedia's editing guideline may cause some kind of bias

The most attractive advantage for us is the large size data availability. If we try to construct an adjacency matrix of Japanese authors, we need to read a huge amount of literature and extract the relationships, or if we were fortunate, we would be able to find a book describing the author relationships, still we need to convert the data to digital processing possible form.

The disadvantage 1 can not be avoidable from any data source, though Wikipedia may have more errors than academically reviewed data source. The disadvantage 2 is a kind of nature of Wikipedia, we could not avoid this kind of bias. However, Wikipedia's other nature is not only one person is writing a page, thus, we hope this bias is not so severe. We need to explain the disadvantage 3. What is the edit guideline bias? The Wikipedia edit guideline recommends to add some specific links in the page, this may cause some bias. We will see such example in the result section. Although what is the definition of bias is a difficult problem. Even we thought there is some bias in the pages, others may not see there is a bias. We need to be careful adjusting bias. This adjusting may re-interpret the Wikipedia data. In a sense, this could be a filter of the observer. We may put some bias to change the Wikipedia's link structure under the name of removing bias. Although, this is our hobby project and as long as we write what kind of operation we did on the link structure data, we think it is fine. Whenever we altered some link structures, we will mention it.

Here, we can understand that all the disadvantages are a kind of link connection error. This error is difficult to detect if we can use only one source data. Of course what is the correct connection is beyond of this article. We have defined the connection is in the Wikipedia. Although, do we have only one data source?  No. We have several data sources. Wikipedia provides the same kind of data in other language Wikipedia. For example, Japanese authors are also listed in the English Wikipedia. Of course, Japanese author data is richer in Japanese Wikipedia than other language Wikipedia. Moreover, many Japanese author pages in English Wikipedia might be just translated from Japanese Wikipedia. This suggests that the data are not independent. If a English page is the exact translation of Japanese corresponding page, then the same link error would be there. In that case, we cannot detect the link error. However, as far as we see, these data are not totally independent data, but these data are not all the exact translation, i.e., we see some dependency. We should care the dependency of the data, but we think we can still use these data for a validation in some extent.

No comments: