Sunday, April 08, 2007

Foreigners Typing Chinese

(updated on Thursday 19 April 2007)

A recent study at Hong Kong University using magnetic brain imaging shows that native Chinese speakers reading Chinese characters generate a pattern of brain activity related to spatial processing, believed to be because each visual Chinese character maps to a syllable of speech. When they read English, this same brain activity occurs, whereas the brain activity is different in native English speakers reading English. The researchers believe this suggests that native Chinese readers use their Chinese-reading capability when reading English, associating written English syllables directly to the sounds, and being less capable of applying rules to convert sequences of individual letters into sounds as native English-readers do.

Many years ago, a teacher of Chinese told me the only way foreigners could learn to write Chinese is the same way the Chinese learn at school, that is, to practise writing each character many times until they know it. But I found that as I learnt to read Chinese, I found it easiest to recognize the simplest written characters, the ones that weren't composed of others, and hardest to recognize the most complexly written ones.

Most learning material requires us to learn the commonly used characters first. When I learnt "我是一个人。这个人是我的爱人。", I'd have no trouble remembering the 一 and 人, both atomic characters. Characters made up of only two components, such as 个 (人 above 丨) and 的 (白 left of 勺), were easy if I already knew the meaning of their constituent components, 人、丨、白、and 勺. Other two-component characters, such as 是 (日 over 正), 这 (辶 around the bottom-left of 文), and 我 (手 before and joined to 戈), were more difficult when I hadn't already learnt 日、正、手、and 戈. And a character like 爱 (爪 over 冂 over 十 around the top-left of 又) was very difficult to recognize when reading. Further complicating the issue, the choice of what's an atomic component in Chinese is arbitrary, as Chinese characters can be continually sub-divided in a tree-like structure all the way down to individual strokes, eg, 正 is 一 over 止, 止 is 丄 around the bottom-left of 卜, 文 is 亠 over 乂, 亠 is 丶 over 一.

I'm wondering if the converse of the Hong Kong University study applies to native English speakers reading Chinese. When we read Chinese, do we use the same brain activity as when we read English? If so, that may be why I "spell out" the components of Chinese characters I don't know. A native Chinese reader may see the character 爱 as a single shape, but I see 4 distinct "letters" (爪冂十又) in that character. For those of us whose first reading language was alphabetic, perhaps the easiest way to learn to read Chinese is to learn the components first. In primary school we practised writing each letter of the alphabet before we practised writing words, and maybe we should learn all the components of Chinese before writing characters.

In the future, many Westerners will need to learn Chinese characters for various reasons, primarily to remain competitive in information-based industries. Because I believe the Chinese will one day use simplified Chinese characters in computer programs to gain the advantage that comes from writing terser code, I've been motivated to study how foreigners can best learn Chinese characters so they can also benefit from writing such terser code. I needed to analyse the constituents of Chinese characters.

There were many issues to consider when choosing which characters to include for analysis. Simplified characters are used in Mainland China, complex ones in Hong Kong, Taiwan, and Chinatowns. My experience is that when someone needs to learn one type of Chinese characters, they often need to learn those of the other type also. Additionally, Unicode treats characters from both Chinese and Japanese languages as one script, the "Unified CJK Characters", or "Unihan" characters. The Unicode Consortium considered that there would be too many duplicately encoded characters if they encoded simplified Chinese, complex Chinese, Japanese, and the other East Asian ideographic scripts separately, and since the characters are all descended from one ancient source, the consortium unified the scripts using some guidelines. These guidelines provided that characters with the same meaning and abstract shape be unified into one, irrespective of source script and ignoring typeface differences. (There were exceptions for the 20,000 most common CJK characters, such as not unifying two characters that are distinct in a single source script.)

I chose the 20,923 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block. This seemed the most natural place to draw the ring, as foreigners who learn Chinese characters for programming would likely progress to using other CJK characters. I inputted decompositions of those characters, aided by various internet listings. For each decomposition, I recorded a "Decomposition Configuration", and up to two constituent components. Only pictorial configurations were used, not semantic ones, because the decompositions are intended for foreigners when they first start to learn CJK characters, before they're familiar with meanings of characters. Where characters had typeface differences I used the one in the Unicode spec reference listing. When there was more than one possible configuration, I selected one based on how I thought a fellow foreigner would analyse the character. In future, I may add alternative configurations to the data.

I had to "create" a few thousand "characters" to cater for decomposition components not themselves among my collected characters. (Though I could have found many of them in the CJK extension A and B blocks, I wanted to keep those out of scope.) To represent these extra characters in the data, sometimes I used a multi-character sequence, sometimes a user-defined glyph. I've avoided using them in this blog entry because my user-defined fonts can't be used, instead I'm relying on verbal descriptions such as "the bottom part of 节".

With this data, I've been able to perform much analysis using programming. I've used the Groovy Programming Language, similar in function to the more popular Perl, Python, PHP, and Ruby "scripting" languages. The two key issues regarding the data were choosing the decomposition configurations and choosing the basic atomic components.

My main purpose is to generate an optimal collection of atomic components and their keyboard assignments that are intuitive for foreigners first learning to type CJK characters pictorially. Asian IME's generally target one particular script (simplified Chinese, traditional Chinese, Japanese, or Korean), not all CJK characters, and are intended for those who already intimately know the written language of that script, eg, the popular Wubizixing is for native readers of simplified Chinese, not for foreigners.

Wubizixing enables characters to be entered using at most four keystrokes by representing characters by the first three and last components only. Westerners are used to typing full words, and wouldn't want to type my previous sentence as "wubg enas chas to be entd usig at most four keys by repg chas by the firt thre and last coms only". Wubizixing also makes commonly-used groups of components available as one keystroke, such as 鱼 (instead of 勹,田 and 一),早 (instead of 日 and 十), and 四 (instead of 囗 and 儿). As a result, Wubizixing uses almost 200 components. The popular Taiwanese Cangjie input method is similar. But Westerners aren't used to typing common sequences of letters, such as 'tion', 'str', and 'th', with one keystroke. For foreigners the number of keystrokes for a character should be related to its complexity, not its frequency of use. There's another input method, Wubihua, used with mobile phones in mainland China, using only 5 components, the basic strokes 一丨丿丶乙. This is too low level for a foreigner, similar to typing a keystroke for each stroke of a capital Latin letter. I'm looking for a set of components at a level between these 5 strokes and Wubizixing's 200 components.

Although I've only analysed 20,000 of the 70,000 CJK characters presently in Unicode, my set of atomic components must be useful for entering those other CJK characters also. There's another 10,000 Korean-only characters, which are based on 24 Korean components. If I match my set of CJK components in the keyboard assignments to the standard Korean keyboard as much as possible, I could have 80,000 of the 100,000 Unicode characters in one seamless pictorial-based input method.

I don't want to rush through this final stage of the task, as it's important I get this part right.