Tagxedo Blog's Blog

Posted Sept. 10, 2010   37373 views

Language is Hard, Fun

have made two enhancements to the way CJK (Chinese, Japanese, Korean) languages are handled.

First, for Chinese, a better word analysis is implemented, resulting in a more accurate analysis. The above image is made from a few articles about the actor 甄子丹 and his new film 精武風雲.陳真 reprising a role played by 李小龍 (Bruce Lee), using the new algorithm. All single-character words have been stripped (similar to how common words are removed, but manually). I hope you agree, regardless of whether you understand Chinese, that the word analysis engine did a good job.

Second, I now treat Korean and Japanese more like Latin-based languages. Korean, for example, has a relatively small set of alphabets, and words are space-delimited. Previously, Korean words were analyzed as if they are Chinese. This is akin to analyzing the word "Mississippi" as if each letter is a "word", leading to the conclusion that "iss" is a word with frequency 2, "p" and "i" are words with frequency 2, and "M" is a word with frequency 1. In other words, a complete nonsense.

In the case of Japanese, Hiragana and Katakana will be handled like Latin-based alphabets, but Kanji will be handled like, yup, Chinese characters. I am not sure about whether words are delimited. If any Japanese speaker is interested in improving Tagxedo please help me out.