Skip to main content

Posts

Showing posts with the label linear algebra

Authors in a Markov matrix Part 2 (11): Appendix

Appendix A: Unicode and Python 2.7.x This time I develop python programs. I use python 2.7.3. Handling Unicode was needed to process web pages, not only for Japanese and German web pages, but also for English pages. Because some of the English authors have accent characters.  In the early development stage, I was bothered UnicodeDecodeError and UnicodeEncodeError exceptions. Here I will explain what they are, why they raised, and how to handle them. How the Unicode encodes characters? As far as I understand, Unicode uses two maps to encode characters. This depends on how you understand this coding system. I hadn't known this until I worked on this research. My understanding was that there are many kind of Unicode, like UTF-8, UTF-16, UTF-32. But this was my misunderstanding. UTF-8 is how to encode the Unicode data and Unicode is an encoding system how to encode characters. UTF-8 is one of the mapping methods, or transformation formats and UTF-8 is not Unicode (Universal ch...

Authors in a Markov matrix Part 2 (10) Experimental results: Which author do people find most inspiring?

Conclusion To find out that which author do people find most inspiring, we used the link structure of Wikipedia. First we extracted the link structure of Wikipedia and create the adjacency matrix, then we apply an eigenanalysis method, which is also called PageRank, to answer the first question. We showed the results of German, English, and Japanese authors.  We also compared the same category (authors), but between the different data source, i.e., different language Wikipedia. We can see the interesting similarity and also difference.  Personally, one of the authors was surprised me that Winston Churchill and Issac Newton have a high ranking score. He didn't know Winston Churchill is the Nobel Prize winner of the literature. Computational literature Recently, I use a mathematical approach or an information scientific approach to understand literature and languages. This approach has a huge limitation, but on the other hand, it gives me some measureable values. Brené...

Authors in a Markov matrix Part 2 (9) Experimental results: Which author do people find most inspiring?

This time is a follow up discussion of the result. No link found problem We have an impression there are some amount of Japanese author links that have no reference page in German Wikipedia. We didn't check the exact numbers, but while we debugged the program, we looked into several pages. A typical no link reference case is, for instance, a page mentioned about 良寛 (Ryōkan) has a link to Ryokan, or Sōseki link to Seseki, and so on. These special characters are often omitted, this causes no link reference found. Cross reference between Wikipedia It was relatively easy to make a cross reference list between English and German Wikipedia results since these Wikipedias share how to write the author names, i.e., using the Latin character set. However, Japanese Wikipedias uses Japanese characters for the author's name. For example, Lowis Carroll is ルイス・キャロル in Japanese Wikipedia. In Japanese Wikipedia has the information also in Latin characters, but, the Wiki page ke...

Authors in a Markov matrix Part 2 (8) Experimental results: Which author do people find most inspiring?

Wikipedia's Category problem The category problem here is: we expect a specific category has some expected authors on the list, but the actual Wikipedia's category doesn't have the authors we expected. This causes some data missing. There are three interesting cases we found in the following subsections. We didn't do any additional process for this problem. For example, ``Shakespeare does not exist as an English writer in the Japanese Wikipedia.'' Since we did nothing for this, there is no Shakespeare in the English author rank table in Japanese Wikipedia in our result. We tried to obtain the data as automatic as possible since this is just our Sunday hobby research project. We didn't spend much time for the fine tuning of these problem. But these are not intuitive (e.g., Shakespeare is not an English author in Japanese Wikipedia.), so how to automatically fill this gap between Wikipedia sense and our intuition is the future work. No Shakespeare in t...

Authors in a Markov matrix Part 2 (7) Experimental results: Which author do people find most inspiring?

We have seen the eigenanalysis results of the authors in Wikipedia. I found it is interesting just going through the result tables, and thinking why this person is there. For instance, I was surprised Winston Churchill is in the high rank in the English literature. But a few of my friends pointed me out that he is the only the Nobel prize winner of literature and the prime minister of Britain. Following some articles, I would like to discuss about the results. Discussion Matrix rank Table 3 shows that the matrix is not full rank even we removed sink rank pages and out going link only pages. This means there are some groups. These group inside there are connection by links, but between these groups have no links. It is interesting to analyze these groups, but, this will be a future work. Japanese Wikipedia template bias Our first PageRank result of Japanese Wikipedia surprised us. Because, Sōseki Natume, Ryūnosuke Akutagawa, Yukio Mishima, Ōgai Mori are all under 100 rank. Ge...

Authors in a Markov matrix Part 2 (6): The result of German authors

The result of Japanese authors Table 10: Japanese author rank result in de wikipedia. Table 11: Japanese author rank result in en wikipedia. Table 12: Japanese author rank result in ja wikipedia. Next we will discuss about the results.

Authors in a Markov matrix Part 2 (5): The result of German authors

The result of English authors Table 7: English author rank result in de wikipedia. Table 8:English author rank result in en wikipedia.   Table 9: English author rank result in ja wikipedia. (Our page rank implementation can only find 29 valid authors.)

Authors in a Markov matrix Part 2 (4): The result of German authors

Three articles from this, we will show the PageRank(Eigen analysis) result. The result of German authors Table 4: German author rank result in de wikipedia. ``en'' is en wikipedia's rank result. Table 5: German author rank result in en wikipedia. Table 6: German author rank result in ja wikipedia. (Our page rank implementation can only find 31 valid authors.)

Authors in a Markov matrix Part 2 (3) Experimental results: Which author do people find most inspiring?

In this article, I will explain about our implementation and how the adjacency matrix looks like. Implementation We implemented the following programs: Link_Vector_Extractor: Generate the author vector from the data. Graph_Extractor: Generate the adjacency matrix from the data and the author vector. Page_Rank: Compute page rank. Remapper: Re-map the author vector according to the page rank result. Our experiment is done with the computer CPU: Intel(R) Core(TM) DUO CPU P8400, 2 Cores, OS: 64bit Linux 3.2.0.32, Kubuntu 12.04. We used the following software: Python 2.7.3, Beautiful Soup 4.0.2, matlab R2006a, octave 3.2.4. Adjacency matrix Figures 2, 3, 4 show the adjacency matrices. In these Figures, blue points represent the connection. Figure 2: Adjacency matrices. Top to bottom: German authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia.org. Figure 3: Adjacency matrices. Top to bottom: English authors in de. wikipedia.org, en.wikipedia.org, ja.wikipedia...

Authors in a Markov matrix Part 2 (2) Experimental results: Which author do people find most inspiring?

Experiment The Wikipedia data we have used is shown in Table 1. Fortunately, every Wikipedia has a list of the authors for each language, we use the list as the root page. We downloaded the author pages identified by the root page. We cared the server load for the download, we downloaded data by 15 sec/page to not overload the server. The page ``石原慎太郎'' in the Japanese Wikipedia was the only compressed page, we expanded the page when we run our analysis tool. We had a choice of the root pages, for instance, we used Liste_britischer_Schriftsteller for English author list in the German Wikipedia instead of Liste_englischsprachiger_Schriftsteller. Which root page we chose is in Table 1. There is no reason these list should be chosen, they are just our choice in this experiment. All the files were downloaded 2012-5-30 for this experiment. Table 1: Experimental data set

Authors in a Markov matrix Part 2 (1) Experimental results: Which author do people find most inspiring?

This is the part 2 of the article, experimental results. Until the last article, I talked about the question, ``Which author do people find most inspiring?'' From now on, I would like to talk about an answer. Analyzing relationships between authors Author graph generation method We apply eigenanalysis on Japanese, English, and German authors to find  out which author do people find most inspiring in the literature in a sense of author network topology. First we need to generate an author graph that represents the relationships between authors. Of course we could generate such graph by hand, i.e., researching a lot of documents about authors. However, the number of famous Japanese authors maybe more than 1000. This is just our Sunday fun hobby project, we don't have enough time to do that. Fortunately, nowadays we can use cloud knowledge. The natural solution seems to be using the information of Wikipedia. We can generate an adjacency matrix from the Wikipedia's ...

Authors in a Markov matrix: Which author do people find most inspiring? (24)

Google PageRank PageRank [2] is a method to compute a ranking of web pages. Sergey Brin and Lawrence Page modeled the adjacency relationship of web pages via link and analyzed the importance of web pages using eigenanalysis. They proposed to use eigenvector as a criterion of the web search. Soon, they established a company called ``Google'' that used this method for their search engine. In their paper, the basic idea had already been proposed by Jon Kleinberg, who considered the web environment can be analyzed by eigenanalysis. Eigenanalysis itself has been studied from 18 century and it is fundamental in linear algebra. Then, the PageRank is not new and these two are just lucky? I don't think so. Maybe many people thought eigenanalysis to use web search at that time, but, these two really implemented this simple and solid idea. Also the size of web is huge and it is not trivial to compute the eigenvector of the web matrix. I think there is a huge difference between just...

Authors in a Markov matrix: Which author do people find most inspiring? (23)

Again, at which station am I? In the last section, I explained eigenanalysis using inhabitant change of two cities. Now I would like to extend this method involving more objects instead of two. Let's back to the Berlin S-Bahn station graph of Figure 7. Graph example 2. Each node is a train station. Same as moving around cities, people can be moving around the stations. Let's assume a person choose the next station equal possibility. By this assumption, how the people moving around is defined by the connection of the stations. Namely, the people's staying possibility at each station depends on the train station topology. We can generate the matrix that represents how the people moving around from the station adjacency matrix. If a station connected two stations, the possibility of each connection is used is \(\frac{1}{2}\) each since we assume a person choose the next station equal possible. If a station connected two stations, the possibility of each connection is...

Authors in a Markov matrix: Which author do people find most inspiring? (22)

How to compute the eigenvectrors? Here I don't explain the details of algorithm to compute the eigenvectors and eigenvalues. Since there are some free software available to compute them. I use octave's function eig()  in this article. Let's compute the eigenvectors and eigenvalues of the matrix \(M\). octave:34> [L V] = eig(M) L =    0.83205  -0.70711    0.55470   0.70711 V = Diagonal Matrix    1.00000         0          0   0.50000 By the way, I told you eigenvalue seems equivalent to the matrix. Yes, this is correct, however, one number can not exactly represent a matrix. If it is possible, why we need a matrix at all? A usual matrix has multiple eigenvalues and eigenvectors. But the number of elements of a matrix is \(n^2\) and its eigenvalues is \(n\). So, eigenvalues are still simpler. The eigenvalues of matrix \(M\) are 1 and 0.5, corresponding eigenvectors are (0.83205, 0...

Authors in a Markov matrix: Which author do people find most inspiring? (21)

Here we don't want to consider what is the result of applying a matrix to a vector, but we would like to consider what the property of a matrix. How all the vectors behave by a matrix, not only one specific vector. Because the combination of inhabitants of Berlin and Potsdam. Even the total inhabitants is fixed, i.e., 1,000,000 then, we can make 1,000,001 different vectors. We can also change the total inhabitants to any number. However, you have already seen, this matrix can be described two numbers (= one vector). This is the reason eigenanalysis is an important idea in linear algebra. Because you can only think just two numbers, instead of millions in this example. When Berlin and Potsdam's inhabitants started by any number, enough years later, the inhabitants becomes specific numbers. The vector of inhabitants of Berlin 600, Potsdam 400 is a special vector of this matrix and therefore, it has a name. It is called an eigenvector. Figure 12 shows this vector on Figure 1...

Authors in a Markov matrix: Which author do people find most inspiring? (20)

I sometimes find a mathematical insight in great literature or music. Bach's music has a lot of mathematical pattern, Haiku or some kind of lyrics also have a formal pattern. Some of Souseki Natume also have mathematical insight. These people sublimate these patterns to arts. I feel sometimes happy that I can see these aspect. Let me show you one example. Do you know ``Meijinden (A master's story)'' by Atusi Nakajima? I like his ``Deshi (a student)'' and ``Riryou'', but my best story from him is Meijinden. A master of archery want to be the master of masters. One point he finally met the master of masters. He shot 20 birds down with one arrow. The master said to him, ``I see you can shoot an arrow. But that is just a shoot of shooting. You seem not to know a shoot of non-shooting.'' and the master shot 20 birds down without any arrows. The shooting world is limited if one uses an arrow, what is the substance of shooting? If we want to over the...

Authors in a Markov matrix: Which author do people find most inspiring? (19)

In the last article, we saw the number of inhabitants became one specific value. I raised a question, does this result depend on the initial state? In other words, are these conversing numbers the property of the matrix only or the property of both the matrix and the initial population vector? Hypothesis 3: If the initial state is different, the result may change. For example, we started Berlin 900, Potsdam 100. If we started Berlin 0, Potsdam 1000, the result maybe not the same. We can compute this by octave again. Set the initial state Berlin 0, Potsdam 1000. octave:10> p = [0 1000]'; octave:11> M * p    300    700 octave:12> M^2 * p    450.00    550.00 octave:13> M^3 * p    525.00    475.00 octave:14> M^10 * p    599.41    400.59 octave:15> M^100 * p    600.00    400.00 Surprisingly, the results are the same even if we changed the initial ...

Authors in a Markov matrix: Which author do people find most inspiring? (18)

Let's back to the question, how does the number of inhabitants looks like many years later. Before we perform the calculation, let's make some hypotheses and check them out. Why do we make some hypotheses? Since it is the fun of mathematics. If I predict something based on mathematics, then I can check it out by calculation. If my prediction is correct, that's the fun. It is like my plan succeeded in my chess game. One thing is clear, the total number is 1000 since no one is born or die and these people always stay in one of the cities. Hypothesis 1: One day, all the inhabitants move to Berlin since the ratio of staying Berlin (0.8) is larger than that of the Potsdam (0.7). This hypothesis seems not correct. The amount of outgoing people is 20% of the inhabitants of Berlin, when Berlin inhabitants increased, the total amount of people is larger than the incoming amount. For example, When Berlin's inhabitants is 900, 20% of it is 180, on the other hand, Potsdam...

Authors in a Markov matrix: Which author do people find most inspiring? (17)

Last time, we introduced Markov matrix as an extension of adjacency matrix. Let's assume \(M\) has the following form. \begin{eqnarray} M &=& \left[ \begin{array}{cc} 0.8 & 0.3 \\ 0.2 & 0.7 \\ \end{array} \right] \label{eq:berlin_potsdam_mat} \end{eqnarray} The meaning of each element of this specific example is: \begin{eqnarray*} M &=& \left[ \begin{array}{cc} \mbox{Stay Berlin} & \mbox{P $\rightarrow$ B} \\ \mbox{B $\rightarrow$ P} & \mbox{Stay Potsdam} \\ \end{array} \right] \end{eqnarray*} Where \(\mbox{B $\rightarrow$ P}\) is a ratio of moving people from Berlin to Potsdam and \(\mbox{P $\rightarrow$ B}\) is a ratio of from Potsdam to Berlin. The ratio is against to the current inhabitant.  In Equation of \(M\), 80% of inhabitants Berlin will stay in Berlin after one year. 20% people will move from Berlin to Potsdam. As you see, the total sum of column is 1 (0.8 + 0.2 = 1.0) since we consider a...

Authors in a Markov matrix: Which author do people find most inspiring? (16)

Markov matrix The train station example shows that applying an adjacency matrix to a stations vector can tell us which station can we reach from any starting station. Let's think one step further, we can look into the quantity side of the movement in stead of possible side of the movement. There is a city called Potsdam 40 minutes away from Berlin by a train (S-Bahn). These two cities are close and a lot f movement of people between them. Some move Potsdam to Berlin and others move other way around. The adjacency matrix of the two cities is the following. \begin{eqnarray*} \begin{array}{|c|c|} \hline \mbox{Station} & \mbox{1 step} \\ \hline \mbox{Weinmeisterstr} & 1 \\ \mbox{Alexanderplatz} & 0 \\ \mbox{Hackescher Markt} & 0 \\ \mbox{Jannowitzbruecke} & 0 \\ \hline \end{array} \end{eqnarray*} Let me show you what is the meaning of columns and rows of the matrix. \begin{eqnarray*} \begin{array}{ccc} & \mbox{Berlin} ...