Friday, June 29, 2007

Pictorial Analysis of CJK Characters

(republished portion)

One day programmers will use all the Unicode tokens in their programs, not just the ASCII ones. To enter the CJK characters, which make up 70% of Unicode tokens, programmers must enter the pictorial representation of the character, if they don't know its sound in some Asian language or its meaning. I've been analysing the pictorial structure of the 20,000 or so CJK characters in the Unicode CJK Unified Ideograph block with a view to making them easy for Westerners to type.

Basic Constituents

The Chinese often categorize their characters and components based on the first stroke, depending on whether it's horizontal (一), vertical (丨), left-leaning (丿), right-leaning(丶), or bending (eg, 乙). But I saw many more basic strokes than that.

I saw the non-bending basic strokes of equal length as being on a circle:
  1. slightly upwards from horizontal stroke (the bottom upwards stroke in 扌)
  2. horizontal stroke (the common 一)
  3. slightly downwards from horizontal stroke (the bottom right stroke of 之)
  4. perfect right-leaning diagonal stroke (the right side of 八)
  5. vertical stroke (the common 丨)
  6. slightly left-leaning from vertical stroke (丿, the left side of 厂)
  7. perfect left-leaning diagonal stroke (the left side of 八)
  8. almost level left-leaning stroke (top of 禾)
The only difference between the first and last ones is the direction of the stroke and they would be considered as the same stroke by a foreigner when they first started to learn characters. Strokes at one point on the circle often transform into a stroke next to it (eg, the horizontal stroke of 子 from type 2 into type 1 in 孙).

The perfect right-leaning and left-leaning strokes can each shorten into a short right-leaning dot (丶) and short left-leaning dot (top of 白) respectively. These two short dots can often transform into each other. There's also a longer right-leaning dot (eg, the right-leaning stroke of the top half of 爻) that's an only slightly shortened form of the perfect right-leaning stroke.

Some of these 8 strokes also have variants with hooks:
  • the bottom of 冫 - a variant of 1 above
  • the stem of 戈 - a variant of 4
  • the bottom of 丁 - a variant of 5
  • the stem of 乄 - a variant of 7
The other basic strokes in CJK characters are distinguished by how many times they bend, and the bending direction.

Strokes that bend once downwards:
  • the top-right surrounding part of 司 (the stem of 犭 is a variant)
  • the bottom-right part of 片
  • 乛 - a variant of each of the above two
  • the rightmost part of 又
  • the bottom of 辶
  • 乀 - a rare character
  • 乁 - a rare character
Strokes that bend once rightwards:
  • the bottom-left of 亾
  • 乚 (including the bottom of 心 in some fonts where it slopes before hooking)
  • right part of inside of 四
  • bottom-right of 鼠
  • bottom of 饣 - a variant of each of the above ones
  • main part of 厶
  • leftmost part of 女
  • central horizontal stroke of 牙
Strokes that bend twice, first downwards then rightwards:
  • stroke from topleft to bottomright of 卍
  • rightmost stroke of top half of 殳
  • rightmost stroke of 九
  • bottom of 气
  • bottom of 讠- a variant of each of the above ones
Strokes that bend twice, first rightwards then downwards:
  • bottom half of 丂
  • stroke from topleft to bottomright of 卐
  • central stem of 专
Strokes that bend three times, first downwards, second rightwards, and then downwards:
  • rightside of 乃
  • central stem of 及
  • right-most stroke of 郑

Components Transformed

When analysing the CJK characters into constituent components, sometimes one component was transformed into another, other times, two components were joined together in some way.

I related pairs of similarly-shaped components to each other with a special transformation. Examples are: 子 and 孑, 勺 and 夕, 己 and 已, 千 and 干, 壬 and 王, 日 and 曰, 土 and 士, 刀 and the bottom of 节.

Another transformation is to repeat a certain component a number of times in a certain shape:
  • twice across (从夶朋林奻)
  • twice downwards (多昌畕)
  • three in a triangle (晶众姦森)
  • three across (巛州川)
  • three down (perhaps, the topright-surrounding component in 司 when constructing 為)
  • four in a square (叕朤燚)
  • four across (the 丨 in 卌)
  • four down (perhaps, the 一 when constructing 隹)
Some components reflect another acrosswards (eg the components of 北 and of 非, and 爿片) or downwards (eg the components of 忽 according to some).

Some characters are best analysed as outlines of another (凹 of 凵, 凸 of 丄).


Components Joined Together

Components can be joined together in many ways.

The most common join configuration is across, the second most common is downwards. The same two CJK components can sometimes be arranged both across and downwards to form different characters, eg 叭只, 略畧, 杠杢, 杍李, 峒峝, 叻另, and 呐呙. A handful of components join diagonally (eg 以, the part of 亥 under the 亠 is 丩 diagonally joined to 人). When two components join downwards, they can touch (eg 示去卡且丘元早光兄支).

A common configuration is where one component surrounds another somehow:
  • on two sides at the top left (厷厄右后)
  • on two sides at the bottom left (亾这迎廷咫尫爬)
  • on two sides at the top right (句匂勾可司匃)
  • on three sides around the top (网闪用瓜同)
  • on three sides around the bottom (凶鼎函)
  • on three sides around the left (区匜匹)
  • on three sides around the right (the left side of 臦, the smaller one on the right of 龜)
  • on all sides (回囚囟)
A character can be inserted between others, either across (eg 衎衒衍 is 行 with another character like 干玄氵 between) or downwards (eg 裒衷衰 is 衣 with another like 臼中母 between).

When two components join, two strokes can be molded together into one stroke, either across (我 is 手 before 戈), downwards (缶 is 午 over 山; 里 is 田 over 土; 重 is 千 over 里), repeating down (岀), or surrounding (飛).

Two components can be threaded together (申 is 曰 threaded with 丨), overlap in various ways(肉民包世氏冉丑內西), or be within each other (夷來乘坐爽兆臾幽巫吏束夾噩承乖).

Components can be modified by a single stroke of some sort in some position (圡太主凡玉叉弋勺, 生午牛, 必才少).

It was straightforward to relate these transformations and join configurations together using an inheritance hierarchy when programming.

Saturday, June 23, 2007

Regex Too Terse

The purpose of making programming languages terser is so they'll be more readable. But regexes are too terse. To make them readable, we need to make them more verbose. We can format them more easily by using the ?x flag, but the syntax is so different to the languages they're embedded within, they still stick out, requiring mental effort to digest. But JVM-languages like Groovy aren't stuck with them because Java has them. Just as JVM bytecodes are generated by Groovy, so also the terse regex syntax could be too. What would a more verbose yet readable regex syntax for Groovy look like?

For starters, we wouldn't need to embed the regex expression inside slashes / /, as the syntax would be mixable with Groovy's. Perhaps it could be generated by a builder. A small amount of syntax could remain the same. The alternation operator | acts like Groovy's | operator. The option operator ? has a parallel in Groovy's ?. operator. We could keep escaped control characters, and change the meaning of \b from word boundary to backspace as in Groovy, so we'd have '\t\n\b\f\r' instead of /\t\n\013\f\r/.

For character classes, we could use Groovy's sequence syntax, ('a'..'j') instead of /[a-j]/, or [*'a'..'z', *'A'..'Z', '_', *'0'..'9'] instead of /[a-zA-Z_0-9]/. We could use !['a','c','e'] instead of /[^ace]/. Pre-defined classes could have special variable names within the regex builder context, eg, ws for /\s/, digit for /\d/, and word for /\w/. We could even define our own character classes, eg, def hexDigit= [*'0'..'9', *'A'..'F', *'a'..'f'], or def notDigit = !digit for /\D/.

For groups, parentheses are sensible, so 'a'+('b'|'c')+'d' as new syntax for /a(b|c)d/, but groups should be non-capturing by default. For capturing groups, we can use variable names instead of numbers, ie, 'a'+(bc='b'|'c')+'d'+bc+'e' instead of /a(b|c)d\1e/.

For the wildcard, perhaps replace the dot with an underscore , as in 'a'+_+'c' instead of /a.c/. For the repetition operators, we could use sequences, so a new special syntax 'a'*(0..) + 'b'*(1..) instead of /a*b+/, and 'a'*(3..5) instead of /a{3,5}/.

Flags could be indicated by names heading a closure, eg, caseInsignificant{ 'aBc'*(1..)+'DeFg' } instead of /?i:(aBc)+DeFg/. Lazy and possessive operators could be indicated by such names, eg, lazy{ 'abc'*(0..) } instead of /(abc)*?/, and possessive{ 'def'*(1..) } instead of /(def)++/.

Lookarounds could also be shown by names, after{'a'} instead of (?=a), !after{'b'} instead of (?!b), and before{'c'} instead of (?<=c). The pre-defined anchors would have special variable names, eg, wordBoundary instead of /\b/, lineStart instead of /$/, and lineEnd instead of /^/. And we could define our own anchors, eg, def sentenceEnd= before{['.','?','!']}.

I thought of this replacement syntax off the top of my head. It's just an idea for a RegexBuilder for Groovy. We could have Groovy statements interacting with the regex syntax, just like other builders do, so we could capture information that would normally be lost in the regex backtracking. Maybe regex functions normally outside the regex string, such as text replacement, could also be done within the builder syntax.

So instead of regex syntax being so terse it's unreadable, and sticking out like a sore thumb from the cool Groovy syntax, it could be made more verbose so it's easily readable, and mixes nicely with other Groovy syntax.

Programming in Unicode

(republished portion)

Unicode, used by both Java and Windows, now has 100,000 characters, a collection of alphabetic letters, CJK (unified Chinese, Japanese, and Korean) characters, digits, symbols, punctuation, etc. But computer programs are still written using a mere 100 tokens, the ASCII characters. It's difficult to key in other characters, and programmers don't know other alphabets. But in a few years, using all the Unicode characters in programs may be standard. Math is a language that uses many more tokens than programming languages, both dedicated symbols and letters from many alphabets. Math can describe concepts extremely tersely, since the greater the range of tokens a language has, the terser it can be written. Yet programming is limited to the 100 tokens on the keyboard. Many people can type those 100 characters faster than they can write them, but can write thousands more they can't type.

Committed programmers are continually looking for ways to make programs terser, yet still readable. They choose languages and tools that enable such tersity, so programming languages evolved, into the 2GL (assembler), the 3GL, and the visual 4GL. But 4GL’s were limited in their scalability and readability. It’s easier to write a program in a 4GL than a 3GL, but more difficult to read and debug it. So some used IDE’s, supplementing 3GL's with visual aids. Others looked for a more productive language, so terser languages, such as Perl, Python, and Ruby, became popular. Regular expressions are a successful attempt at tersity, now used by many languages, but many consider them unreadable. The K programming language, used by financial businesses, could be the tersest language ever invented. It only uses ASCII symbols, but overloads them profusely. However, the price is the inability to give different precedences to the operators, so everything unbracketed is evaluated from the right. The tersity of present-day programming languages is derived from maximizing the use of grammar, the different ways tokens can be combined. The same 100 tokens are used.

Perhaps adding the many Unicode symbols to programming languages would enable terser programs to be written. Operator overloading in C++ was a similar attempt at tersity. Programmers could define meanings for some combinations of the 35 ASCII symbols. Although programs became terser, they were more difficult to understand because of the unpredictable meaning of these symbols in different code contexts, and they were eventually dropped in Java. The problem wasn't with operator overloading itself, but with the uncontrolled association of meanings with each operator. Eventually certain meanings would have become generally accepted, the others falling into disuse, but this would have taken many years, with too many incompatible uses produced in the meantime. If there was such a problem with a few dozen operators, what hope would there be for the hundreds of unused Unicode symbols? If programmers were allowed to overload them with any meaning, the increase in program tersity would be at the cost of readability. Although some Unicode symbols will have an obvious meaning, such as some math symbols, most would have no meaning that could be transferred easily to the programming context. To retain readability of programs in a terse language, the meanings of the Unicode symbols would have to be carefully controlled by the custodians of that language. They would activate new Unicode symbols at a gradual pace only, with control of their meanings, after carefully considering existing use of the symbols.

Programming languages do, however, already allow Unicode characters in some parts of their programs. The contents of strings and comments can use any Unicode character. User-defined names can use all the alphabetic letters and CJK characters, and because there already exists agreed meanings for combinations of these, derived from their respective natural languages, we can increase tersity while keeping readability. But the core of the language, the grammar keywords and symbols, and names in supplied libraries, still only use ASCII characters. Perhaps some programmers use non-Latin characters wherever they can in their programs. A browse through the computer shelves of a typical bookshop in mainland China suggests they only do so for comments and contents of strings, not for user-defined names.

Programmers from cultures not using the Latin alphabet won't be motivated to use their own alphabets in user-defined names when they don't with pre-supplied names, such as keywords or standard libraries. Often, most of the names in a program are from libraries standard to the language. To trigger the widespread use of Unicode characters from non-ASCII alphabets in programs, the pre-supplied names must also be in those alphabets. And this could easily be done. The grammar of a language and its vocabulary are two different concepts. A programming language grammar could conceivably have many vocabulary options. Almost all programming languages only have English. Other vocabularies could be based on other natural languages. A Spanish vocabulary plugged into a certain programming language would have Spanish names for the keywords, modules, classes, methods, variables, etc.

Computer software nowadays is internationalized, webpages are, and most programming languages enable internationalized software. But the languages themselves and their libraries are not internationalized. An internationalized programming language would enable a fully functional program to be written totally in a natural language of one’s choice. Not only could all user-defined names be specified in any alphabet, but also all keywords and names in the standard libraries would be available in many natural languages. Ideally, when a software supplier ships a library, they'll specify the public names in many languages in the definition. But this is likely to have a slow uptake, so languages must allow a library to be translated in an incremental manner easily from one natural language into another. Some languages let programmers use mixins and interceptors to do this to various degrees. And some could conceivably allow a preprocessor, pluggable lexer, or closures to internationalize the keywords. But full foreign-language support must be a declared aim of a language's development.

Internationalized programming languages are presently rare, but they will follow the trend of the software they're used to write. Soon enough, most programming languages will be internationalized. The first to be translated will probably be the core Java class libraries, and the first language translated into, probably simplified Chinese.