Can we measure the complexly of natural language by an entropy based compression method? (6)


When we wrote an article in different languages, the length of the document differs even the contents are the same. But, if we compress these files by a entropy based compression algorithm, they become almost the same size. Even we wrote it in German which has complex grammatical structure, or in Japanese with completely different character system. From this observation, I have a hypothesis: ``The complexity of natural languages are more or less the same.'' Of course this article tested only one document and only three different languages. Therefore, this cannot be any proof of this hypothesis. But still, I am interested in the result.

We need more experiences, but, now I got some ideas of the applications if this hypothesis stands.

  1.  Comparison of news articles: Assume the news source is in English and it is translated into Japanese. If we compress these two articles and the compressed size differs more than 50\%, I suspect the quality of the translation. Sometimes, there are quite different content news I saw. Other interesting application is Wikipedia pages comparison between languages.
  2. Comparison of article authors: How much does the entropy differ between articles written by one person? Maybe the entropy are similar. We cannot compare two different documents, but how is the compression ratio? For example, we can compare the compression ratio of Francis Bacon's documents and William Shakespeare's documents. However, the good authors might be able to simulate other persons. Therefore, this comparison might be difficult.
These are all my hypotheses. But I am pretty sure some of the people know more about this. For example, computer game people always concern that how to fit all the data in a limited memory devices. The data must be compressed. Now big game companies become international and they translated many contents. I believe these people have some knowledge about the contents entropy. If someone know something, please put some comments.

My friend Daniel L. also pointed out that phone signal compression may depends on the language. Once I read an article NTT Docomo uses interesting basis for frequency decomposition and lossy compression. I did not recall it is suited for especially Japanese sounds or not. These language dependency algorithm (or language dependent parameters) may be an interesting topic.

No comments: