Skip to main content

Can we measure the complexly of natural language by an entropy based compression method? (6)


Conclusion

When we wrote an article in different languages, the length of the document differs even the contents are the same. But, if we compress these files by a entropy based compression algorithm, they become almost the same size. Even we wrote it in German which has complex grammatical structure, or in Japanese with completely different character system. From this observation, I have a hypothesis: ``The complexity of natural languages are more or less the same.'' Of course this article tested only one document and only three different languages. Therefore, this cannot be any proof of this hypothesis. But still, I am interested in the result.

We need more experiences, but, now I got some ideas of the applications if this hypothesis stands.

  1.  Comparison of news articles: Assume the news source is in English and it is translated into Japanese. If we compress these two articles and the compressed size differs more than 50\%, I suspect the quality of the translation. Sometimes, there are quite different content news I saw. Other interesting application is Wikipedia pages comparison between languages.
  2. Comparison of article authors: How much does the entropy differ between articles written by one person? Maybe the entropy are similar. We cannot compare two different documents, but how is the compression ratio? For example, we can compare the compression ratio of Francis Bacon's documents and William Shakespeare's documents. However, the good authors might be able to simulate other persons. Therefore, this comparison might be difficult.
These are all my hypotheses. But I am pretty sure some of the people know more about this. For example, computer game people always concern that how to fit all the data in a limited memory devices. The data must be compressed. Now big game companies become international and they translated many contents. I believe these people have some knowledge about the contents entropy. If someone know something, please put some comments.

My friend Daniel L. also pointed out that phone signal compression may depends on the language. Once I read an article NTT Docomo uses interesting basis for frequency decomposition and lossy compression. I did not recall it is suited for especially Japanese sounds or not. These language dependency algorithm (or language dependent parameters) may be an interesting topic.

Comments

Popular posts from this blog

Why A^{T}A is invertible? (2) Linear Algebra

Why A^{T}A has the inverse Let me explain why A^{T}A has the inverse, if the columns of A are independent. First, if a matrix is n by n, and all the columns are independent, then this is a square full rank matrix. Therefore, there is the inverse. So, the problem is when A is a m by n, rectangle matrix.  Strang's explanation is based on null space. Null space and column space are the fundamental of the linear algebra. This explanation is simple and clear. However, when I was a University student, I did not recall the explanation of the null space in my linear algebra class. Maybe I was careless. I regret that... Explanation based on null space This explanation is based on Strang's book. Column space and null space are the main characters. Let's start with this explanation. Assume  x  where x is in the null space of A .  The matrices ( A^{T} A ) and A share the null space as the following: This means, if x is in the null space of A , x is also in the n...

Gauss's quote for positive, negative, and imaginary number

Recently I watched the following great videos about imaginary numbers by Welch Labs. https://youtu.be/T647CGsuOVU?list=PLiaHhY2iBX9g6KIvZ_703G3KJXapKkNaF I like this article about naming of math by Kalid Azad. https://betterexplained.com/articles/learning-tip-idea-name/ Both articles mentioned about Gauss, who suggested to use other names of positive, negative, and imaginary numbers. Gauss wrote these names are wrong and that is one of the reason people didn't get why negative times negative is positive, or, pure positive imaginary times pure positive imaginary is negative real number. I made a few videos about explaining why -1 * -1 = +1, too. Explanation: why -1 * -1 = +1 by pattern https://youtu.be/uD7JRdAzKP8 Explanation: why -1 * -1 = +1 by climbing a mountain https://youtu.be/uD7JRdAzKP8 But actually Gauss's insight is much powerful. The original is in the Gauß, Werke, Bd. 2, S. 178 . Hätte man +1, -1, √-1) nicht positiv, negative, imaginäre (oder gar um...

Why parallelogram area is |ad-bc|?

Here is my question. The area of parallelogram is the difference of these two rectangles (red rectangle - blue rectangle). This is not intuitive for me. If you also think it is not so intuitive, you might interested in my slides. I try to explain this for hight school students. Slides:  A bit intuitive (for me) explanation of area of parallelogram  (to my site, external link) .