Skip to main content

Posts

Showing posts from December, 2012

Authors in a Markov matrix Part 2 (7) Experimental results: Which author do people find most inspiring?

We have seen the eigenanalysis results of the authors in Wikipedia. I found it is interesting just going through the result tables, and thinking why this person is there. For instance, I was surprised Winston Churchill is in the high rank in the English literature. But a few of my friends pointed me out that he is the only the Nobel prize winner of literature and the prime minister of Britain. Following some articles, I would like to discuss about the results. Discussion Matrix rank Table 3 shows that the matrix is not full rank even we removed sink rank pages and out going link only pages. This means there are some groups. These group inside there are connection by links, but between these groups have no links. It is interesting to analyze these groups, but, this will be a future work. Japanese Wikipedia template bias Our first PageRank result of Japanese Wikipedia surprised us. Because, Sōseki Natume, Ryūnosuke Akutagawa, Yukio Mishima, Ōgai Mori are all under 100 rank. Ge

Authors in a Markov matrix Part 2 (6): The result of German authors

The result of Japanese authors Table 10: Japanese author rank result in de wikipedia. Table 11: Japanese author rank result in en wikipedia. Table 12: Japanese author rank result in ja wikipedia. Next we will discuss about the results.

Authors in a Markov matrix Part 2 (5): The result of German authors

The result of English authors Table 7: English author rank result in de wikipedia. Table 8:English author rank result in en wikipedia.   Table 9: English author rank result in ja wikipedia. (Our page rank implementation can only find 29 valid authors.)

Authors in a Markov matrix Part 2 (4): The result of German authors

Three articles from this, we will show the PageRank(Eigen analysis) result. The result of German authors Table 4: German author rank result in de wikipedia. ``en'' is en wikipedia's rank result. Table 5: German author rank result in en wikipedia. Table 6: German author rank result in ja wikipedia. (Our page rank implementation can only find 31 valid authors.)

Authors in a Markov matrix Part 2 (3) Experimental results: Which author do people find most inspiring?

In this article, I will explain about our implementation and how the adjacency matrix looks like. Implementation We implemented the following programs: Link_Vector_Extractor: Generate the author vector from the data. Graph_Extractor: Generate the adjacency matrix from the data and the author vector. Page_Rank: Compute page rank. Remapper: Re-map the author vector according to the page rank result. Our experiment is done with the computer CPU: Intel(R) Core(TM) DUO CPU P8400, 2 Cores, OS: 64bit Linux 3.2.0.32, Kubuntu 12.04. We used the following software: Python 2.7.3, Beautiful Soup 4.0.2, matlab R2006a, octave 3.2.4. Adjacency matrix Figures 2, 3, 4 show the adjacency matrices. In these Figures, blue points represent the connection. Figure 2: Adjacency matrices. Top to bottom: German authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia.org. Figure 3: Adjacency matrices. Top to bottom: English authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia

Authors in a Markov matrix Part 2 (2) Experimental results: Which author do people find most inspiring?

Experiment The Wikipedia data we have used is shown in Table 1. Fortunately, every Wikipedia has a list of the authors for each language, we use the list as the root page. We downloaded the author pages identified by the root page. We cared the server load for the download, we downloaded data by 15 sec/page to not overload the server. The page ``石原慎太郎'' in the Japanese Wikipedia was the only compressed page, we expanded the page when we run our analysis tool. We had a choice of the root pages, for instance, we used Liste_britischer_Schriftsteller for English author list in the German Wikipedia instead of Liste_englischsprachiger_Schriftsteller. Which root page we chose is in Table 1. There is no reason these list should be chosen, they are just our choice in this experiment. All the files were downloaded 2012-5-30 for this experiment. Table 1: Experimental data set

Authors in a Markov matrix Part 2 (1) Experimental results: Which author do people find most inspiring?

This is the part 2 of the article, experimental results. Until the last article, I talked about the question, ``Which author do people find most inspiring?'' From now on, I would like to talk about an answer. Analyzing relationships between authors Author graph generation method We apply eigenanalysis on Japanese, English, and German authors to find  out which author do people find most inspiring in the literature in a sense of author network topology. First we need to generate an author graph that represents the relationships between authors. Of course we could generate such graph by hand, i.e., researching a lot of documents about authors. However, the number of famous Japanese authors maybe more than 1000. This is just our Sunday fun hobby project, we don't have enough time to do that. Fortunately, nowadays we can use cloud knowledge. The natural solution seems to be using the information of Wikipedia. We can generate an adjacency matrix from the Wikipedia's

Authors in a Markov matrix: Which author do people find most inspiring? (24)

Google PageRank PageRank [2] is a method to compute a ranking of web pages. Sergey Brin and Lawrence Page modeled the adjacency relationship of web pages via link and analyzed the importance of web pages using eigenanalysis. They proposed to use eigenvector as a criterion of the web search. Soon, they established a company called ``Google'' that used this method for their search engine. In their paper, the basic idea had already been proposed by Jon Kleinberg, who considered the web environment can be analyzed by eigenanalysis. Eigenanalysis itself has been studied from 18 century and it is fundamental in linear algebra. Then, the PageRank is not new and these two are just lucky? I don't think so. Maybe many people thought eigenanalysis to use web search at that time, but, these two really implemented this simple and solid idea. Also the size of web is huge and it is not trivial to compute the eigenvector of the web matrix. I think there is a huge difference between just

Authors in a Markov matrix: Which author do people find most inspiring? (23)

Again, at which station am I? In the last section, I explained eigenanalysis using inhabitant change of two cities. Now I would like to extend this method involving more objects instead of two. Let's back to the Berlin S-Bahn station graph of Figure 7. Graph example 2. Each node is a train station. Same as moving around cities, people can be moving around the stations. Let's assume a person choose the next station equal possibility. By this assumption, how the people moving around is defined by the connection of the stations. Namely, the people's staying possibility at each station depends on the train station topology. We can generate the matrix that represents how the people moving around from the station adjacency matrix. If a station connected two stations, the possibility of each connection is used is \(\frac{1}{2}\) each since we assume a person choose the next station equal possible. If a station connected two stations, the possibility of each connection is

Authors in a Markov matrix: Which author do people find most inspiring? (22)

How to compute the eigenvectrors? Here I don't explain the details of algorithm to compute the eigenvectors and eigenvalues. Since there are some free software available to compute them. I use octave's function eig()  in this article. Let's compute the eigenvectors and eigenvalues of the matrix \(M\). octave:34> [L V] = eig(M) L =    0.83205  -0.70711    0.55470   0.70711 V = Diagonal Matrix    1.00000         0          0   0.50000 By the way, I told you eigenvalue seems equivalent to the matrix. Yes, this is correct, however, one number can not exactly represent a matrix. If it is possible, why we need a matrix at all? A usual matrix has multiple eigenvalues and eigenvectors. But the number of elements of a matrix is \(n^2\) and its eigenvalues is \(n\). So, eigenvalues are still simpler. The eigenvalues of matrix \(M\) are 1 and 0.5, corresponding eigenvectors are (0.83205, 0.55470) and (-0.70711, 0.70711). Here we ignore the eigenvalue 0.5 since eigenvect

Authors in a Markov matrix: Which author do people find most inspiring? (21)

Here we don't want to consider what is the result of applying a matrix to a vector, but we would like to consider what the property of a matrix. How all the vectors behave by a matrix, not only one specific vector. Because the combination of inhabitants of Berlin and Potsdam. Even the total inhabitants is fixed, i.e., 1,000,000 then, we can make 1,000,001 different vectors. We can also change the total inhabitants to any number. However, you have already seen, this matrix can be described two numbers (= one vector). This is the reason eigenanalysis is an important idea in linear algebra. Because you can only think just two numbers, instead of millions in this example. When Berlin and Potsdam's inhabitants started by any number, enough years later, the inhabitants becomes specific numbers. The vector of inhabitants of Berlin 600, Potsdam 400 is a special vector of this matrix and therefore, it has a name. It is called an eigenvector. Figure 12 shows this vector on Figure 1

Authors in a Markov matrix: Which author do people find most inspiring? (20)

I sometimes find a mathematical insight in great literature or music. Bach's music has a lot of mathematical pattern, Haiku or some kind of lyrics also have a formal pattern. Some of Souseki Natume also have mathematical insight. These people sublimate these patterns to arts. I feel sometimes happy that I can see these aspect. Let me show you one example. Do you know ``Meijinden (A master's story)'' by Atusi Nakajima? I like his ``Deshi (a student)'' and ``Riryou'', but my best story from him is Meijinden. A master of archery want to be the master of masters. One point he finally met the master of masters. He shot 20 birds down with one arrow. The master said to him, ``I see you can shoot an arrow. But that is just a shoot of shooting. You seem not to know a shoot of non-shooting.'' and the master shot 20 birds down without any arrows. The shooting world is limited if one uses an arrow, what is the substance of shooting? If we want to over the

Authors in a Markov matrix: Which author do people find most inspiring? (19)

In the last article, we saw the number of inhabitants became one specific value. I raised a question, does this result depend on the initial state? In other words, are these conversing numbers the property of the matrix only or the property of both the matrix and the initial population vector? Hypothesis 3: If the initial state is different, the result may change. For example, we started Berlin 900, Potsdam 100. If we started Berlin 0, Potsdam 1000, the result maybe not the same. We can compute this by octave again. Set the initial state Berlin 0, Potsdam 1000. octave:10> p = [0 1000]'; octave:11> M * p    300    700 octave:12> M^2 * p    450.00    550.00 octave:13> M^3 * p    525.00    475.00 octave:14> M^10 * p    599.41    400.59 octave:15> M^100 * p    600.00    400.00 Surprisingly, the results are the same even if we changed the initial state.  Figure 12 shows many different initial states and how they are changed after many years

Authors in a Markov matrix: Which author do people find most inspiring? (18)

Let's back to the question, how does the number of inhabitants looks like many years later. Before we perform the calculation, let's make some hypotheses and check them out. Why do we make some hypotheses? Since it is the fun of mathematics. If I predict something based on mathematics, then I can check it out by calculation. If my prediction is correct, that's the fun. It is like my plan succeeded in my chess game. One thing is clear, the total number is 1000 since no one is born or die and these people always stay in one of the cities. Hypothesis 1: One day, all the inhabitants move to Berlin since the ratio of staying Berlin (0.8) is larger than that of the Potsdam (0.7). This hypothesis seems not correct. The amount of outgoing people is 20% of the inhabitants of Berlin, when Berlin inhabitants increased, the total amount of people is larger than the incoming amount. For example, When Berlin's inhabitants is 900, 20% of it is 180, on the other hand, Potsdam

Authors in a Markov matrix: Which author do people find most inspiring? (17)

Last time, we introduced Markov matrix as an extension of adjacency matrix. Let's assume \(M\) has the following form. \begin{eqnarray} M &=& \left[ \begin{array}{cc} 0.8 & 0.3 \\ 0.2 & 0.7 \\ \end{array} \right] \label{eq:berlin_potsdam_mat} \end{eqnarray} The meaning of each element of this specific example is: \begin{eqnarray*} M &=& \left[ \begin{array}{cc} \mbox{Stay Berlin} & \mbox{P $\rightarrow$ B} \\ \mbox{B $\rightarrow$ P} & \mbox{Stay Potsdam} \\ \end{array} \right] \end{eqnarray*} Where \(\mbox{B $\rightarrow$ P}\) is a ratio of moving people from Berlin to Potsdam and \(\mbox{P $\rightarrow$ B}\) is a ratio of from Potsdam to Berlin. The ratio is against to the current inhabitant.  In Equation of \(M\), 80% of inhabitants Berlin will stay in Berlin after one year. 20% people will move from Berlin to Potsdam. As you see, the total sum of column is 1 (0.8 + 0.2 = 1.0) since we consider a

Authors in a Markov matrix: Which author do people find most inspiring? (16)

Markov matrix The train station example shows that applying an adjacency matrix to a stations vector can tell us which station can we reach from any starting station. Let's think one step further, we can look into the quantity side of the movement in stead of possible side of the movement. There is a city called Potsdam 40 minutes away from Berlin by a train (S-Bahn). These two cities are close and a lot f movement of people between them. Some move Potsdam to Berlin and others move other way around. The adjacency matrix of the two cities is the following. \begin{eqnarray*} \begin{array}{|c|c|} \hline \mbox{Station} & \mbox{1 step} \\ \hline \mbox{Weinmeisterstr} & 1 \\ \mbox{Alexanderplatz} & 0 \\ \mbox{Hackescher Markt} & 0 \\ \mbox{Jannowitzbruecke} & 0 \\ \hline \end{array} \end{eqnarray*} Let me show you what is the meaning of columns and rows of the matrix. \begin{eqnarray*} \begin{array}{ccc} & \mbox{Berlin} &am

Authors in a Markov matrix: Which author do people find most inspiring? (15)

Eigenanalysis At which station am I? An adjacency matrix represents graph topology (how the nodes are connected). However, a matrix can not only represent the connections, but also can be applied to a vector and can generate a new vector. We saw the adjacency matrix can generate a new station vector. We can continue this computation a bit more. Why do we this? I want to show you a bit interesting stuffs. Let's assume we are first at Weinmeisterstr. We repeat visiting to next station or staying the station. Each computation result shows how many possible paths are there to reach each station. Let's see how this number goes. This is initial state, we are at Weinmeisterstr station. \begin{eqnarray*} \begin{array}{|c|c|} \hline \mbox{Station} & \mbox{1 step} \\ \hline \mbox{Weinmeisterstr} & 1 \\ \mbox{Alexanderplatz} & 0 \\ \mbox{Hackescher Markt} & 0 \\ \mbox{Jannowitzbruecke} & 0 \\ \hline \end{array} \end{eqnarray*} Second

Authors in a Markov matrix: Which author do people find most inspiring? (14)

Let's continue the story of matrix calculation from the last time. It's a bit cumbersome to write the same adjacency matrix every time, let's call this station connection matrix \(M_{bahn}\) since Bahn means train in German. \begin{eqnarray*} M_{bahn} &= & \left[ \begin{array}{cccc} 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ \end{array} \right] \end{eqnarray*} We can compute neighbor stations by applying \(M_{bahn}\) to the station vector. If we apply the \(M_{bahn}\) twice to the the station vector, you can get the reachable station by two hops. The resulting number means how many possible paths to reach the station. Let's apply the \(M_{bahn}\) twice to the station vector that means one is at the Weinmeisterstr. \begin{eqnarray*} (M_{bahn})^2 \left[ \begin{array}{c} 1 \\ 0 \\ 0 \\ 0 \\ \end{array} \right] &=& \left[ \begi

Authors in a Markov matrix: Which author do people find most inspiring? (13)

Last time I forgot to tell you about the diagonal element of the matrix. In the last article, the diagonal elements of the adjacency matrix are zero. That means you can not stay the station. However, I think each station is connected to itself.  It seems not so reasonable if someone want to stay the station, one always need to move one station and come back. When you want to force such move, then, this adjacency matrix is useful. However, if I want to go to Alexanderplatz from Alexanderplatz, which means I want to stay at Alexanderplatz, I just want to stay the station. If we allow to stay at a station, the adjacency matrix should be the following. \begin{eqnarray*} \begin{array}{ccccc} & \mbox{Wein.} & \mbox{Alex.} & \mbox{Hack.} & \mbox{Jann.} \\ \begin{array}{c} \\ \mbox{Wein.}\\ \mbox{Alex.}\\ \mbox{Hack.}\\ \mbox{Jann.}\\ \end{array} & \left[ \begin{array}{c} 1 \\ 1 \\ 0 \\ 0 \\ \end{array} \right. &

A pattern girl

D is a bit noisy. She sometimes made a story and talked out laud in the class. However, I respect a person who can make a new story. So, I don't want to discourage her to do that. On the other hand, she disturbs other children. So, I try to ask her to concentrate the problem. Last time, D practiced plus calculation with Zahlenhaus. In the beginning, Zahlenhaus has plus and minus examples. Figure 1 shows a Zahlenhaus example. Zahlenhaus: 7 = 4 + 3 A practice example is the following: \begin{eqnarray*}  4 + 3 &=&  \\  3 + 4 &=&  \\  7 - 3 &=&  \\  7 - 4 &=&  \\ \end{eqnarray*} Another typical practice example is the following: \begin{eqnarray*}  1 + 6 &=&  \\  6 + 1 &=&  \\  7 - 1 &=&  \\  7 - 6 &=&  \\ \end{eqnarray*} This practice book tries to explain that the plus calculation is commutative, i.e., we can add numbers in different order, i.e., \(1 + 6 = 6 + 1\). For the minus calculation, we

Authors in a Markov matrix: Which author do people find most inspiring? (12)

Adjacency matrix and topology of graph Last time, I explained about what is an adjacency matrix. This time I would like to show the relationships between an adjacency matrix and the corresponding graph. An adjacency matrix shows how the nodes are connected. When we care only the existence of connection, we call such mathematics topology . Sometimes only the connection is important even in every day life. For example, we usually only care the city train connection, the real distance on the map is less important. Two stations might be less than 300m away, but if there is no train connection between these two stations, sometimes you think these trains are totally separated stations. The map of   Liniennetz found in the web site of Berlin city transportation is such example. You can see the ring-bahn as a nearly perfect ellipse on the map, but, if you check the real map, it is not the same shape. The train map is focusing on the connection between stations. This is a kind of topolo