Friday, June 29, 2007

Pictorial Analysis of CJK Characters

(republished portion)

One day programmers will use all the Unicode tokens in their programs, not just the ASCII ones. To enter the CJK characters, which make up 70% of Unicode tokens, programmers must enter the pictorial representation of the character, if they don't know its sound in some Asian language or its meaning. I've been analysing the pictorial structure of the 20,000 or so CJK characters in the Unicode CJK Unified Ideograph block with a view to making them easy for Westerners to type.

Basic Constituents

The Chinese often categorize their characters and components based on the first stroke, depending on whether it's horizontal (一), vertical (丨), left-leaning (丿), right-leaning(丶), or bending (eg, 乙). But I saw many more basic strokes than that.

I saw the non-bending basic strokes of equal length as being on a circle:
  1. slightly upwards from horizontal stroke (the bottom upwards stroke in 扌)
  2. horizontal stroke (the common 一)
  3. slightly downwards from horizontal stroke (the bottom right stroke of 之)
  4. perfect right-leaning diagonal stroke (the right side of 八)
  5. vertical stroke (the common 丨)
  6. slightly left-leaning from vertical stroke (丿, the left side of 厂)
  7. perfect left-leaning diagonal stroke (the left side of 八)
  8. almost level left-leaning stroke (top of 禾)
The only difference between the first and last ones is the direction of the stroke and they would be considered as the same stroke by a foreigner when they first started to learn characters. Strokes at one point on the circle often transform into a stroke next to it (eg, the horizontal stroke of 子 from type 2 into type 1 in 孙).

The perfect right-leaning and left-leaning strokes can each shorten into a short right-leaning dot (丶) and short left-leaning dot (top of 白) respectively. These two short dots can often transform into each other. There's also a longer right-leaning dot (eg, the right-leaning stroke of the top half of 爻) that's an only slightly shortened form of the perfect right-leaning stroke.

Some of these 8 strokes also have variants with hooks:
  • the bottom of 冫 - a variant of 1 above
  • the stem of 戈 - a variant of 4
  • the bottom of 丁 - a variant of 5
  • the stem of 乄 - a variant of 7
The other basic strokes in CJK characters are distinguished by how many times they bend, and the bending direction.

Strokes that bend once downwards:
  • the top-right surrounding part of 司 (the stem of 犭 is a variant)
  • the bottom-right part of 片
  • 乛 - a variant of each of the above two
  • the rightmost part of 又
  • the bottom of 辶
  • 乀 - a rare character
  • 乁 - a rare character
Strokes that bend once rightwards:
  • the bottom-left of 亾
  • 乚 (including the bottom of 心 in some fonts where it slopes before hooking)
  • right part of inside of 四
  • bottom-right of 鼠
  • bottom of 饣 - a variant of each of the above ones
  • main part of 厶
  • leftmost part of 女
  • central horizontal stroke of 牙
Strokes that bend twice, first downwards then rightwards:
  • stroke from topleft to bottomright of 卍
  • rightmost stroke of top half of 殳
  • rightmost stroke of 九
  • bottom of 气
  • bottom of 讠- a variant of each of the above ones
Strokes that bend twice, first rightwards then downwards:
  • bottom half of 丂
  • stroke from topleft to bottomright of 卐
  • central stem of 专
Strokes that bend three times, first downwards, second rightwards, and then downwards:
  • rightside of 乃
  • central stem of 及
  • right-most stroke of 郑

Components Transformed

When analysing the CJK characters into constituent components, sometimes one component was transformed into another, other times, two components were joined together in some way.

I related pairs of similarly-shaped components to each other with a special transformation. Examples are: 子 and 孑, 勺 and 夕, 己 and 已, 千 and 干, 壬 and 王, 日 and 曰, 土 and 士, 刀 and the bottom of 节.

Another transformation is to repeat a certain component a number of times in a certain shape:
  • twice across (从夶朋林奻)
  • twice downwards (多昌畕)
  • three in a triangle (晶众姦森)
  • three across (巛州川)
  • three down (perhaps, the topright-surrounding component in 司 when constructing 為)
  • four in a square (叕朤燚)
  • four across (the 丨 in 卌)
  • four down (perhaps, the 一 when constructing 隹)
Some components reflect another acrosswards (eg the components of 北 and of 非, and 爿片) or downwards (eg the components of 忽 according to some).

Some characters are best analysed as outlines of another (凹 of 凵, 凸 of 丄).

Components Joined Together

Components can be joined together in many ways.

The most common join configuration is across, the second most common is downwards. The same two CJK components can sometimes be arranged both across and downwards to form different characters, eg 叭只, 略畧, 杠杢, 杍李, 峒峝, 叻另, and 呐呙. A handful of components join diagonally (eg 以, the part of 亥 under the 亠 is 丩 diagonally joined to 人). When two components join downwards, they can touch (eg 示去卡且丘元早光兄支).

A common configuration is where one component surrounds another somehow:
  • on two sides at the top left (厷厄右后)
  • on two sides at the bottom left (亾这迎廷咫尫爬)
  • on two sides at the top right (句匂勾可司匃)
  • on three sides around the top (网闪用瓜同)
  • on three sides around the bottom (凶鼎函)
  • on three sides around the left (区匜匹)
  • on three sides around the right (the left side of 臦, the smaller one on the right of 龜)
  • on all sides (回囚囟)
A character can be inserted between others, either across (eg 衎衒衍 is 行 with another character like 干玄氵 between) or downwards (eg 裒衷衰 is 衣 with another like 臼中母 between).

When two components join, two strokes can be molded together into one stroke, either across (我 is 手 before 戈), downwards (缶 is 午 over 山; 里 is 田 over 土; 重 is 千 over 里), repeating down (岀), or surrounding (飛).

Two components can be threaded together (申 is 曰 threaded with 丨), overlap in various ways(肉民包世氏冉丑內西), or be within each other (夷來乘坐爽兆臾幽巫吏束夾噩承乖).

Components can be modified by a single stroke of some sort in some position (圡太主凡玉叉弋勺, 生午牛, 必才少).

It was straightforward to relate these transformations and join configurations together using an inheritance hierarchy when programming.

No comments: