Saturday, November 22, 2008

Stress and Unstress in Computer Languages

Computer languages could learn a few things from natural languages in their design...

Natural Language
Many natural languages, such as English, make a distinction between stressed and unstressed words. In general, nouns, verbs, and adjectives (incl adverbs ending in -ly) are stressed, while grammar words are unstressed.

For example: “I walked the spotty dog to the shop, quickly bought some bread, and returned home”. (I've bolded the syllables we stress during speech in this and following examples.)

We stress the nouns (dog, shop, bread, home), adjectives (spotty, quick), and verbs (walk, buy, return), and don't stress the grammar words (I, the, to, -ly, some, and). (Note: In Transformational Grammar, adverbs ending in -ly are considered to be a specific inflectional form of the corresponding adjectives.)

Examples of unstressed grammar words in English are conjunctions (and, or, but), conjunctive adverbs (while, because), pronouns (this, you, which), determiners (any, his), auxiliary verbs (is, may), prepositions (to, on, after), and other unclassed words (existential there, infinitive to), as well as many inflectional morphemes (-s, -'s, -ing, -ly).

Verbs are often only half-stressed instead of fully stressed, and prepositions half-stressed instead of unstressed, depending on the surrounding context, e.g. “The teacher saw the book behind the desk.” (Here, I've bold-italicized the half-stressed words.)

English has a clear distinction between grammar words and lexical words (nouns, adjectives/adverbs, and verbs) in speech.

Many languages distinguish between lexical and grammar words in their writing systems. German capitalizes the first letter of each noun. (Dutch stopped doing this in 1948, and English in the 1700's). Japanese uses Chinese characters for nouns and many adjectives, and the Japanese alphabet for grammar words and many verbs.

When using grammar words in a lexical capacity, we stress them when speaking, e.g. “I put an 'is' followed by an 'on', before the 'desk' with a 'the' before it, to make a predicate.” And when writing, we put the grammar words we're using as lexical ones inside quotes.

Using stress and unstress to separate lexical and grammar words enables English, and probably all natural languages, to be self-referential.


Computer Languages
Virtually every computer language differentiates between lexical words and grammar words.

Assembler and Cobol used indentation and leading keywords to distinguish different types of statements, and space and comma to separate items. Like many languages after them, the limited set of keywords couldn't be used for user-defined names. Fortran introduced a simple infix expression syntax for math calculations, using special symbols (+ - * etc) for the precedenced infix operators, and ( ) for bracketing. Lisp removed the indentation and keywords completely, making everything use bracketing, with space for separation, and a prefix syntax. APL removed the precedences, but introduced many more symbols for the operators. The experimentation continued until C became widespread.

C uses 3 different types of symbols for bracketing, ( ) [ ] { }. C++, Java, and C# added < > for bracketing. C uses space and , ; . for separators, and a large number of operators, organized via a complex precedence system. Java has 53 keywords; C# has 77.

The lexical words of computer languages are clear. Classes and variables are nouns. Functions and methods are verbs. Keywords beginning a statement are imperative verbs, and in some languages are indistinguishable from functions. Modifiers, interfaces, and annotations are adjectives/adverbs. The operators (+ - * / % etc) bear a similarity to prepositions, some of them (+= -= *= etc), to verbs. And I'd suggest the tokens used for bracketing and separators are clear examples of grammar words in computer languages, being similar to conjunctions and conjunctive adverbs.

In general, computer languages use some tokens (e.g. A-Z a-z 0-9 _) for naming lexical words, and others (e.g. symbols and punctuation) for grammar. Occasionally, there's exceptions, such as new and instanceof in Java. Some computer languages use other means. Perl and PHP put a @ before all lexical words, enabling all combinations of tokens to be used for names. This is similar to capitalizing all nouns in German. C# allows @ before any lexical word, but only requires it before those which double as keywords. This is similar to quoting grammar words to use them as lexical ones in English.

Newer programming languages have different ways to use Unicode tokens in names and operators. The display tokens in Unicode fall into six basic categories: letters (L), marks (M), numbers (N), symbols (S), punctuation (P), and separators (Z). Python 3.0 names can begin with any Unicode letter (L), numeric letter (in N), or the underscore (in P); subsequent tokens can also be combining marks (in M), digits (in N), and connector punctuation (in P). Scala names can begin with an upper- or lowercase Unicode letter (in L), the underscore (in P), or the dollar sign (in S); subsequent tokens can also be certain other letters (in L), numeric letters (in N), and digits (in N). Scala operators can include math and other symbols (in S). Almost all languages have the same format for numbers, beginning with a number (in N), perhaps with letters (in L) as subsequent tokens.

Perhaps the easiest way to distinguish between lexical and grammar words in GrerlVy is to use Unicode letters (L), marks (M), and numbers (N) exclusively for lexical words, and symbols (S), punctuation (P), and separators (Z) exclusively for grammar words. Of course, we still have a difficulty with the borderline case: infix operators and prefix methods, which correspond roughly to prepositions and verbs, the half-stressed words in English. I'm still thinking about that one.

No comments: