2012-02-02

Can we measure the complexly of natural language by an entropy based compression method? (5)


Entropy of a document

When I talked with Joerg, I recall my bachelor student time. At that time, I could not write a paper in English directly. Therefore, I first wrote a manuscript in Japanese, then, I translated it in English. The size of each TeX file differed, hoverer, when I compressed these files, I realized the compressed file sizes are similar. I found it interesting, but, I did not think further on that. It was around 1996, so I think I used ``compress'' program.

At the Gruenkohl Party, I recall this story again. I also realized I have translated a few articles to three different languages. For example, Haruki Murakami's Catalunya Prize speech at 2011-6-11. Figure 1 shows the compressed result of the same contents, but the different language and different encoding scheme.

Figure 1. The compression size result of three languages, but the same content's documents. Even the original document size depends on encoding methods, but the compressed sizes become similar.

The raw text size differs depends on languages and encoding. However, entropy based compression tool shows that the entropy of information neither depends on the language, nor the encoding scheme. We use bzip2 version 1.0.5 as an entropy based compression tool. You can download the each original document at original documents, so you can also check the results. If you are interested in, you can also translate your own language and bzip2 it. If you got another language result, please let me know.

I choose this document since this is not my alone work. I asked native speaker friends to help the translation. I explained the Japanese contents to my friends and they choose the words and the structure. If I do this alone, my non native vocabulary or grammatical structure might bias the translation. But these translations have less problem than other my own translations.

Let's see a bit more details. Figure 2 shows two encoding scheme affects the file size even they are the same contents and the same language. UTF-8 needs three bytes for one Kanji, but EUC needs two bytes for one Kanji. Therefore, UTF-8 encoded file is significantly larger. But, if we compress different encoded Japanese files, the size becomes almost the same. This is expected since the contents are exactly the same, it's just different mapping. Therefore, the entropy of the files are the same. However, when these are translated to English and German, the compressed file sizes becomes similar. This is interesting result to me.
Figure 2. The compression size result of Japanese document, two different encoding scheme: EUC and UTF-8. EUC encode one character in two bytes, but UTF-8 encode one character in three bytes. Yet, the bzip2 compressed size becomes similar.