Thursday, April 23, 2009

Interactional function of English and Groovy

Michael A.K. Halliday writes in his 1970 paper Language Structure and Language Function that we should analyze language in terms of its use, considering both its structure and function in so doing. He's found the vast numbers of options embodied in it combine into three relatively independent components, and they each correspond to a certain basic function of language: representational (a.k.a. ideational), interactional (a.k.a. interpersonal), and textual. Within each component, the networks of options are closely interconnected, while between components, the connections are few.

For natural language, the representational component represents our experience of the outside world, and of our consciousness within us. The representational similarities between natural and computer languages are most easily noticed:
Mary.have(@Little lamb)
lamb.fleece.Color = Color.SNOW_WHITE
synchronized{ place-> Mary.go(place); lamb.go(place) }

Computer languages' increasing use of abstraction over the years was no doubt based on the representational component of natural languages, giving rise to the functional and object-oriented paradigms. The ideas represented in computer language must be more precise than those in natural language.

Interactional component of English
The interactional component of language involves its producer and receiver/s, and the relationship between them. For natural language, there's one or more human receivers, and for computer language, one or more electronic producers and/or receivers as well as the human one/s.

In English, the interactional component accounts for:

  • many adverbs of opinion, e.g. “That's an incredibly interesting piece of code!”

  • interjections within a clause, e.g. I'm hoping to, er, well, go back sometime, or even in the middle of words, e.g. abso-bloomin'-lutely

  • expressions of politeness we prepend to English sentences, e.g. “Are you able to...” in front of “Tell me the time”

  • the hundreds of different attitudinal intonations we overlay onto our speech, e.g. ”Dunno!” (can you native English speakers hear that intonation?)

  • the mood, whether indicative e.g. “He's gone.”, interrogative e.g. ”Is she there?”, imperative e.g. ”Go now!”, or exclamative e.g. ”How clever!”

  • the modal structure of English grammar, i.e. verbal phrases have certainty e.g. “I might see him”, ability e.g. ”I can see her”, allowability e.g. ”Can he do that?”, polarity e.g. ”They didn't know”, and/or tense e.g. ”We did make it”

Natural language offers many choices regarding how closely to intertwine the interactional component with the representational.
An example... for closely intertwined reported speech: She said that she had already visited her brother, that the day before she'd been with her teacher, and that at that moment she was shopping with her friend.
and using quoted speech to reduce the tangling between interactional and representational components: She said "I've already visited my brother, yesterday I was with my teacher, and right now I'm shopping with my friend."
Another example of keeping these two components disjoint: I'm going to tell the following story exactly as she told it, the way she said it, not how I'd say it...

The original human languages long ago, just like chimpanzee language today, was perhaps mainly interactional, with the representional component slowly added on afterwards.

Interactional component of computer languages
For computer languages, the interactional component determines how people interact with the program, and how other programs interact with it. Like natural languages, the interactional component came first, and representational abstractions added on later. Many have tried to create a representational-only computer language, perhaps the most successful is Haskell. But the Haskell language creators went to great trouble to tack on the minimumly required interactional component, that of Input/Output. They introduced monads to add the I/O capability onto the “purer” underlying functional-paradigm function. Perhaps some functional-paradigm language creators don't appreciate the centrality of the interactional component in language.

Siobhan Clarke et al, writes about the tyranny of the dominant decomposition:
Current object-oriented design methods suffer from the “tyranny of the dominant decomposition” problem, where the dominant decomposition dimension is by object. As a result, designs are caught in the middle of a significant structural misalignment between requirements and code. The units of abstraction and decomposition of object-oriented designs align well with object-oriented code, as both are written in the object-oriented paradigm, and focus on interfaces, classes and methods. However, requirements specifications tend to relate to major concepts in the end user domain, or capabilities like synchronisation, persistence, and failure handling, etc., all of which are unsuited to the object-oriented paradigm.

The object-paradigm is a representational one. The other user-domain capabilities are interactional ones, either human-to-computer or computer-to-computer. Some examples:
  • I/O actions, i.e. between computer and human/s

  • logging, i.e. between processor and recording medium

  • persistence, database access, i.e. between computer and storage unit/s

  • security, i.e. between computer and certain humans only

  • execution performance, i.e. how to maximize use of computing resources

  • entry point to program, i.e. between procesor and external scheduler

  • concurrency, synchronization, i.e. between two processors in one computer

  • distribution, i.e. between two geographically separated computers

  • exceptions, failure handling, i.e. between results of different human-expected certainties

  • testing, i.e. interaction between two different external humans

These capabilities are often interwoven into the programming code, just as mood and modality are overlaid onto all the finite verbal phrases in English. And just as in English, where various interactional functions can be disentangled from the representation functions, e.g. quoted speech above, so also in computer languages, such user-domain capabilities can be extracted as system-wide aspects in aspect-oriented programming.

AspectJ is a well-known attempt to let each aspect use the same syntax, that of the base language. But the idea of limited AOP is much older, often different syntaxes are used for each different user-domain capability.

I've already blogged about some aspects of the textual component of English and Groovy. Whereas the other two components of language exist for reasons independent of the medium itself, the textual component comes into being because the other two components exist, and refers both to those other two components and to itself self-referentially. The textual component ensures every degree of freedom available in the medium itself is utilized.

In computer languages, the textual component if often called ”syntactic sugar”. Often computer language designers scorn the use of lots of syntactic sugar, but natural language designers, i.e. the speakers of natural languages, use all the syntactic sugar available in the communication medium. Programming languages designers should do the same. In the DLR-targetted Groovy I'm working on, I'm focusing on this aspect of the Groovy Language.

Thursday, February 12, 2009

The Rise of Unicode

The next version of Unicode is v.5.2, the latest of a unified character set now with over 100,000 current tokens. One notable addition to v.5.2 will be the Egyptian hieroglyphs, the earliest known system of human writing. Perhaps they will mark Unicode's coming of age, it being another huge step in representing language with graphical symbols. Let's look at a consolidated short history of writing systems, courtesy of various Wikipedia pages, to see Unicode's rise in perspective...

Egyptian hieroglyphs were invented around 4000-3000 BC. The earliest type of hieroglyph was the logogram, where a common noun (such as sun or mountain) is represented by a simple picture. These existing hieroglyphs were then used as phonograms, to denote more abstract ideas with the same sound. Later, these were modified by extra trailing hieroglyphs, called semagrams, to clarify their meaning in context. About 5000 Egyptian hieroglyphs existed by Roman times. When papyrus replaced stone tablets, the hieroglyphs were simplified to accommodate the new medium, sometimes losing their resemblance to the original picture.

The idea of such hieroglyphic writing quickly spread to Sumeria, and eventually to ancient China. The ancient Egyptian and Sumerian hieroglyphs are no longer used, but modern Chinese characters are descended directly from the ancient Chinese ones. Because Chinese characters spread to Japan and ancient Korea, they're now called CJK characters. By looking at such CJK characters, we can get some idea of how Egyptian hieroglyphs worked. Many CJK characters were originally pictures, such as 日 for sun, 月 for moon, 田 for field, 水 for water, 山 for mountain, 女 for woman, and 子 for child. Some pictures have meanings composed of other meanings, such as 女 (woman) and 子 (child) combining into 好, meaning good. About 80% of Chinese characters are phonetic, consisting of two parts, one semantic, the other primarily phonetic, e.g. 土 sounds like tu, and 口 means mouth, so 吐 also sounds like tu, and means to spit (with the mouth). The phonetic part of many phonetic characters often also provides secondary semantics to the character, e.g. the phonetic 土 (in 吐) means ground, where the spit ends up.

Eventually in Egypt, a set of 24 hieroglyphs called uniliterals evolved, each denoting one consonant sound in ancient Egyptian speech, though they were probably only used for transliterating foreign names. This idea was copied by the Phoenicians by 1200BC, and their symbols spread around the Middle East into various other languages' writing systems, having a major social effect. It's the base of almost all alphabets used in the world today, except CJK characters. These Phoenician symbols for consonants were copied by the ancient Hebrews and for Arabic, but when the Greeks copied them, they adapted the symbols of unused consonants for vowel sounds, becoming the first writing system to represent both consonants and vowels.

Over time, cursive versions of letters evolved for the Latin, Greek, and Cyrillic alphabets so people could write them easily on paper. They used either the block or the cursive letters, but not both, in one document. The Carolingian minuscule became the standard cursive script for the Latin alphabet in Europe from 800AD. Soon after, it became common to mix block (uppercase) and cursive (lowercase) letters in the same document. The most common system was to capitalize the first letter of each sentence and of each noun. Chinese characters have only one case, but that may change soon. Simplified characters were invented in 1950's mainland China, replacing the more complex characters still used in Hong Kong, Taiwan, and western countries. Nowadays in mainland China though, both complex and simplified Chinese are sometimes used in the same document, the complex ones for more formal parts of the document. Perhaps one day complex characters will sometimes mix with simplified ones in the same sentence, turning Chinese into another two-case writing system.

Punctuation was popularized in Europe around the same time as cursive letters. Punctuation is chiefly used to indicate stress, pause, and tone when reading aloud. Underlining is a common way of indicating stress. In English, the comma, semicolon, colon, and period (,;:.) indicated pauses of varying degrees, though nowadays, only comma and period is used much in writing. The question mark (?) replaces the period to indicate a question, of either rising or falling tone; the exclamation mark (!) indicates a sharp falling tone.

The idea of separating words with a special mark also began with the Phoenicians. Irish monks began using spaces in 600-700AD, and this quickly spread throughout Europe. Nowadays, the CJK languages are the only major languages not using some form of word separation. Until recently, the Chinese didn't recognize the concept of word in their language, only of (syllabic) character.

The bracketing function of spoken English is usually performed by saying something at a higher or lower pitch, between two pauses. At first, only the pauses were shown in writing, perhaps by pairs of commas. Hyphens might replace spaces between words to show which ones are grouped together. Eventually, explicit bracketing symbols were introduced at the beginning and end of the bracketed text. Sometimes the same symbol was used to show both the beginning and the end, such as pairs of dashes to indicate appositives, and pairs of quotes, either single or double, to indicate speech. Sometimes different paired symbols were used, such as parentheses ( and ). In the 1700's, Spanish introduced inverted ? and ! at the beginning of clauses, in addition to the right-way-up ones at the end, to bracket questions and exclamations. Paragraphs are another bracketing technique, being indicated by indentation.

Around 1050, movable-type printing was invented in China. Instead of carving an entire page on one block as in block printing, each character was on a separate tiny block. These were fastened together into a plate to reflect a page of a book, and after printing, the plate was broken up and the characters reused. But because thousands of characters needed to be stored and manipulated, making movable-type printing difficult, it never replaced block printing in China. But less than a hundred letters and symbols need to be manipulated for European alphabets, much easier. So when movable-type printing reached Europe, the printing revolution began.

With printing a new type of language matured, one that couldn't be spoken very well, only written: the language of mathematics. Mathematics, unlike natural languages, needs to be precisely represented. Natural languages are very expressive, but can also be quite vague. Numbers were represented by many symbols in ancient Egypt and Sumeria, and had reduced to a mere 10 by the Renaissance. But from then on, mathematics started requiring many more symbols than merely two cases of 26 letters, 10 digits, and some operators. Many symbols were imported from other alphabets, different fonts introduced for Latin letters, and many more symbols invented to accommodate the requirements of writing mathematics. Mathematical symbols are now almost standardized throughout the world. Many other symbol systems, such as those for chemistry, music, and architecture, also require precise representation. Existing writing systems changed to utilize the extra expressiveness that came with movable-type printing. Underlining in handwriting was supplemented with bolding and italics. Parentheses were supplemented with brackets [] and curlies {}.

Fifty years ago, yet another type of language arose, for specifying algorithms: computer languages. The first computer languages were easy to parse, requiring little backtracking, but the most popular syntax, that of C and its descendants, requires more complex logic and greater resources to parse. Most programming languages used a small repetoire of letters, digits, punctuation, and symbols, being limited by the keyboard. Other languages, most notably APL, attempted to use many more, but this never became popular. Unlike mathematics, computer languages relied on parsing syntax, rather than a large variety of tokens, to represent algorithms, being limited by the keyboard. Computer programs generally copied natural language writing systems, using letters, numbers, bracketing, separators, punctuation, and symbols in similar ways. One notable innovation of computer languages, though, is camel case, popularized for names in C-like language syntaxes.

The natural language that spread around the world in modern times, English, doesn't use a strict pronunciation-spelling correspondence, perhaps one of the many reasons it spread so rapidly. English writing therefore caters for people who speak English with widely differing vowel sounds and stress, pause, and tone patterns. In this way, English words are a little like Chinese ideographs. As Asian economies developed, techniques for quickly entering large-character-set natural languages were invented, known as IME's (input method editors). But these Asian countries still use English for computer programming.

Around 1990 Unicode was born, unifying the character sets of the world. Initially, there was only room for about 60,000 tokens in Unicode, so the CJK characters of China, Japan, and Korea were unified to save space. Unicode is also bidirectional, catering to Arabic and Hebrew. Topdown languages such as Mongolian and traditional Chinese script can be simulated with left-to-right or right-to-left directioning by using a special sideways font. However, Unicode didn't become very popular until its UTF-8 encoding was invented 10 years ago, allowing backwards compatibility with ASCII. Another benefit of UTF-8 is there's now room for about one million characters in the Unicode character set, allowing less commonly used scripts such as Egyptian hieroglyphs to be encoded.

Many programming languages have recently adopted different policies for using Unicode tokens in names and operators. The display tokens in Unicode are divided into various categories and subcategories, mirroring their use in natural language writing systems. Examples of such subcategories are: uppercase letters (Lu), lowercase ones (Ll), digits (Nd), non-spacing combining marks, e.g. accents (Mn), spacing combining marks, e.g. Eastern vowel signs (Mc), enclosing marks (Me), invisible separators that take up space (Zs), math symbols (Sm), currency symbols (Sc), start bracketing punctuation (Ps), end bracketing (Pe), initial quote (Pi), final quote (Pf), and connector punctuation, e.g. underscore (Pc).

For it to become popular to use a greater variety of Unicode tokens in computer programs, there must be a commonly available IME for their entry with keyboards. Sun's Fortress provides keystroke sequences for entering mathematical symbols in programs, but leaves it vague whether the Unicode tokens or the ASCII keys used to enter them are the true tokens in the program text. And of course there must be a commonly available font representing every token. Perhaps because of the large number of CJK characters, and the recent technological development of mainland China, a large number of programmers may one day suddenly begin using them in computer programming to make their programs terser.

Language representation using graphical symbols has taken many huge leaps in history: Egyptian hieroglyphs to represent speech around 4000 years ago, an alphabet to represent consonant and vowel sounds by the Phoenicians and Greeks around 2500 years ago, movable-type printing in Europe around 500 years ago, and unifying the world's alphabets and symbols into Unicode a mere 20 years ago. And who knows what the full impact of this latest huge leap will be?