Skip to main content

Can we measure the complexly of natural language by an entropy based compression method? (6)


Conclusion

When we wrote an article in different languages, the length of the document differs even the contents are the same. But, if we compress these files by a entropy based compression algorithm, they become almost the same size. Even we wrote it in German which has complex grammatical structure, or in Japanese with completely different character system. From this observation, I have a hypothesis: ``The complexity of natural languages are more or less the same.'' Of course this article tested only one document and only three different languages. Therefore, this cannot be any proof of this hypothesis. But still, I am interested in the result.

We need more experiences, but, now I got some ideas of the applications if this hypothesis stands.

  1.  Comparison of news articles: Assume the news source is in English and it is translated into Japanese. If we compress these two articles and the compressed size differs more than 50\%, I suspect the quality of the translation. Sometimes, there are quite different content news I saw. Other interesting application is Wikipedia pages comparison between languages.
  2. Comparison of article authors: How much does the entropy differ between articles written by one person? Maybe the entropy are similar. We cannot compare two different documents, but how is the compression ratio? For example, we can compare the compression ratio of Francis Bacon's documents and William Shakespeare's documents. However, the good authors might be able to simulate other persons. Therefore, this comparison might be difficult.
These are all my hypotheses. But I am pretty sure some of the people know more about this. For example, computer game people always concern that how to fit all the data in a limited memory devices. The data must be compressed. Now big game companies become international and they translated many contents. I believe these people have some knowledge about the contents entropy. If someone know something, please put some comments.

My friend Daniel L. also pointed out that phone signal compression may depends on the language. Once I read an article NTT Docomo uses interesting basis for frequency decomposition and lossy compression. I did not recall it is suited for especially Japanese sounds or not. These language dependency algorithm (or language dependent parameters) may be an interesting topic.

Comments

Popular posts from this blog

Gauss's quote for positive, negative, and imaginary number

Recently I watched the following great videos about imaginary numbers by Welch Labs. https://youtu.be/T647CGsuOVU?list=PLiaHhY2iBX9g6KIvZ_703G3KJXapKkNaF I like this article about naming of math by Kalid Azad. https://betterexplained.com/articles/learning-tip-idea-name/ Both articles mentioned about Gauss, who suggested to use other names of positive, negative, and imaginary numbers. Gauss wrote these names are wrong and that is one of the reason people didn't get why negative times negative is positive, or, pure positive imaginary times pure positive imaginary is negative real number. I made a few videos about explaining why -1 * -1 = +1, too. Explanation: why -1 * -1 = +1 by pattern https://youtu.be/uD7JRdAzKP8 Explanation: why -1 * -1 = +1 by climbing a mountain https://youtu.be/uD7JRdAzKP8 But actually Gauss's insight is much powerful. The original is in the Gauß, Werke, Bd. 2, S. 178 . Hätte man +1, -1, √-1) nicht positiv, negative, imaginäre (oder gar um...

Tezuka Osamu's Black Jack, "Shrinking"

I like several novel authors. My first favorite author is probably Teduka, Osamu. I still love him. The list grows by adding Hoshi, Shinichi, Agatha Christie, Hermann Hesse, and so forth. My first favorite article of Tezuka was Atom as most of the (boy's) Tezuka fans did. But my favorite is Black Jack. I try to summarize one story, it is still quite vivid in my memory. I first read this story when I was 13 - 15 years old. I re-read it at least several times since Black Jack is composed of many short episodes. The title should be "ちぢむ (SHRINKING)" or it might be "縮む(Shrinking)". (It is not so convenient to translate this to English, since English does not have a system to say the exact same word in several ways. So I just simulate it with capital letters.) Black Jack is a genius surgeon, but he does not have the license. In short, his medical activity is illegal. His skill is top level in the world, but, the fee is also out-of-law expensive. In the story ...

Why A^{T}A is invertible? (2) Linear Algebra

Why A^{T}A has the inverse Let me explain why A^{T}A has the inverse, if the columns of A are independent. First, if a matrix is n by n, and all the columns are independent, then this is a square full rank matrix. Therefore, there is the inverse. So, the problem is when A is a m by n, rectangle matrix.  Strang's explanation is based on null space. Null space and column space are the fundamental of the linear algebra. This explanation is simple and clear. However, when I was a University student, I did not recall the explanation of the null space in my linear algebra class. Maybe I was careless. I regret that... Explanation based on null space This explanation is based on Strang's book. Column space and null space are the main characters. Let's start with this explanation. Assume  x  where x is in the null space of A .  The matrices ( A^{T} A ) and A share the null space as the following: This means, if x is in the null space of A , x is also in the n...