Skip to main content

Can we measure the complexly of natural language by an entropy based compression method? (5)


Entropy of a document

When I talked with Joerg, I recall my bachelor student time. At that time, I could not write a paper in English directly. Therefore, I first wrote a manuscript in Japanese, then, I translated it in English. The size of each TeX file differed, hoverer, when I compressed these files, I realized the compressed file sizes are similar. I found it interesting, but, I did not think further on that. It was around 1996, so I think I used ``compress'' program.

At the Gruenkohl Party, I recall this story again. I also realized I have translated a few articles to three different languages. For example, Haruki Murakami's Catalunya Prize speech at 2011-6-11. Figure 1 shows the compressed result of the same contents, but the different language and different encoding scheme.

Figure 1. The compression size result of three languages, but the same content's documents. Even the original document size depends on encoding methods, but the compressed sizes become similar.

The raw text size differs depends on languages and encoding. However, entropy based compression tool shows that the entropy of information neither depends on the language, nor the encoding scheme. We use bzip2 version 1.0.5 as an entropy based compression tool. You can download the each original document at original documents, so you can also check the results. If you are interested in, you can also translate your own language and bzip2 it. If you got another language result, please let me know.

I choose this document since this is not my alone work. I asked native speaker friends to help the translation. I explained the Japanese contents to my friends and they choose the words and the structure. If I do this alone, my non native vocabulary or grammatical structure might bias the translation. But these translations have less problem than other my own translations.

Let's see a bit more details. Figure 2 shows two encoding scheme affects the file size even they are the same contents and the same language. UTF-8 needs three bytes for one Kanji, but EUC needs two bytes for one Kanji. Therefore, UTF-8 encoded file is significantly larger. But, if we compress different encoded Japanese files, the size becomes almost the same. This is expected since the contents are exactly the same, it's just different mapping. Therefore, the entropy of the files are the same. However, when these are translated to English and German, the compressed file sizes becomes similar. This is interesting result to me.
Figure 2. The compression size result of Japanese document, two different encoding scheme: EUC and UTF-8. EUC encode one character in two bytes, but UTF-8 encode one character in three bytes. Yet, the bzip2 compressed size becomes similar. 

Comments

Popular posts from this blog

Why A^{T}A is invertible? (2) Linear Algebra

Why A^{T}A has the inverse Let me explain why A^{T}A has the inverse, if the columns of A are independent. First, if a matrix is n by n, and all the columns are independent, then this is a square full rank matrix. Therefore, there is the inverse. So, the problem is when A is a m by n, rectangle matrix.  Strang's explanation is based on null space. Null space and column space are the fundamental of the linear algebra. This explanation is simple and clear. However, when I was a University student, I did not recall the explanation of the null space in my linear algebra class. Maybe I was careless. I regret that... Explanation based on null space This explanation is based on Strang's book. Column space and null space are the main characters. Let's start with this explanation. Assume  x  where x is in the null space of A .  The matrices ( A^{T} A ) and A share the null space as the following: This means, if x is in the null space of A , x is also in the null spa

Gauss's quote for positive, negative, and imaginary number

Recently I watched the following great videos about imaginary numbers by Welch Labs. https://youtu.be/T647CGsuOVU?list=PLiaHhY2iBX9g6KIvZ_703G3KJXapKkNaF I like this article about naming of math by Kalid Azad. https://betterexplained.com/articles/learning-tip-idea-name/ Both articles mentioned about Gauss, who suggested to use other names of positive, negative, and imaginary numbers. Gauss wrote these names are wrong and that is one of the reason people didn't get why negative times negative is positive, or, pure positive imaginary times pure positive imaginary is negative real number. I made a few videos about explaining why -1 * -1 = +1, too. Explanation: why -1 * -1 = +1 by pattern https://youtu.be/uD7JRdAzKP8 Explanation: why -1 * -1 = +1 by climbing a mountain https://youtu.be/uD7JRdAzKP8 But actually Gauss's insight is much powerful. The original is in the Gauß, Werke, Bd. 2, S. 178 . Hätte man +1, -1, √-1) nicht positiv, negative, imaginäre (oder gar um

Why parallelogram area is |ad-bc|?

Here is my question. The area of parallelogram is the difference of these two rectangles (red rectangle - blue rectangle). This is not intuitive for me. If you also think it is not so intuitive, you might interested in my slides. I try to explain this for hight school students. Slides:  A bit intuitive (for me) explanation of area of parallelogram  (to my site, external link) .