Wednesday, April 09, 2008

Syntactic Rank in English and Groovy

There's many similarities between natural languages and computer languages. Let's look at one aspect of language, syntactic rank, and compare it in English and Groovy. The analysis of English is basic only, just enough so I can compare it to Groovy.

Rank in English
In English, structures larger than a sentence are different in written and spoken English: writing has paragraphs and speech has exchanges. Sentences are strung together into a cohesive sequence by pronoun linking, transition phrases, etc. Syntax kicks in below the sentence-level by defining 5 ranks: (1) sentence, (2) clause, (3) phrase, (4) word, (5) morpheme. Items at one rank are composed of items from the next rank down, with the morpheme being the lowest, atomic, rank.

A sentence can be simple, consisting of only one clause, or more complex. For example, this compound-complex sentence:
The batter hit the ball hard, but, because the wind blew it his way, the out-fielder caught it easily.
consists of 3 clauses, in a tree-like manner:
(the batter hit the ball hard)   //compound constituent
but
( because
  (the wind blew it his way)   //complex constituents
  (the out-fielder caught it easily)
)

A clause consists of phrases (sometimes called groups). For example:
Surprisingly, the batter has slammed the ball out of the pitch.
has the tree structure:
surprisingly   //adverb phrase
( the batter   //noun phrase
  ( has slammed   //verb phrase
    the ball   //noun phrase
    (out of the pitch)   //prepositional phrase
  )
)

A very common structure of clause is noun phrase followed by predicate. For example:
The beekeeper became a mountain climber.
is divided into:
the beekeeper   //noun phrase at the head, called a subject
became a mountain climber
    //predicate, which can be further broken down

A phrase consists of words. For example, this noun phrase:
A big bright red truck
has structure:
a
big
(bright red) //not a word, but an example of rank-shifting
truck
The (bright red) isn't a word, but another phrase used as a word. This is called rank-shifting. Sometimes, a rank can be shifted more than one place. For example:
the pay-as-you-earn tax
has a clause shifted to the position of a word.


Phrases can nest deep to many levels easier than other ranks. For example, a noun phrase embellished with adjectives and prepositional phrases:
the big thick book with the silky red cover on the bookshelf by the fireplace
has structure:
( the (big (thick book)) ( with ( the (silky (red cover)) ) ) )
on ( the bookshelf ( by ( the fireplace ) ) )

Finally, words consist of morphemes. Some morphemes are lexical, others are grammatical. For example, the word undiscerningly has structure:
( un
  ( discern   //only one lexical morpheme
    ing)
)
ly

Whereas compound word beekeeper has two lexical morphemes, bee and keeper. Morphemes are the atomic structure in English grammar.


Rank in Groovy
Like English, Groovy is best analyzed as having 5 ranks: (1) top-level, (2) statement, (3) expression, (4) path, (5) primary. It could be useful to match up the ranks of Groovy with those of English.

A top-level is a class, interface, or enum definition, standalone method definition or statement, or package or import statement. For example:
def mean(a, b){
  def c= a + b
  c / 2
}
It could correspond to a sentence in English. Class and method definitions consist of statements, just as sentences consist of clauses. A standalone statement can be a top-level, just as a simple sentence consists of only one clause.

A statement is of various types, e.g. if, while, try, break, expression, or block. A statement in Groovy could correspond to a clause in English.

A common type of the common expression statement is the assignment, e.g:
def b= c + ( d * e )
This could correspond to the subject-predicate style of clause, where b is the subject and the rest is the predicate.

A block statement could correspond to the complex portion within a compound-complex statement.

An expression consists of path structures, just as a phrase consists of words.

One such structure is the closure, which itself consists of statements, entities of a higher rank. This is rank-shifting in Groovy, e.g:
def c= {
  it= it * 2
  println it
  it * 3
}(7)
Compare this with the rank-shifted compound sentence inside the relative clause:
The mountain range, of which the tallest was there and its peak needed knocking off, loomed before them.

Expressions enable the deepest nesting, just as phrases do in English. For example:
a + ( b * ( -c – d ** e ) / (z= f1 – (f2 * f3)) % g ) +
  ( ( !h1 || h2 ) ? i : -j )
In general, the unary operators have highest precedence, then left-associative binary, then ternary, then right-associative binary.

Each path structure consists of a head followed by path elements, e.g. arguments, subscripts, closures, names after . or ?. or *. or .& or .@ For example:
callMet(d, e, 2.71, "abcdefg")[7].&fin()?.gun()
  .g2{->}.g3(1, "b"){->}.@hun.&@iun*.jun[a,b,c]

Rank-shifting is far more likely within Groovy path structures, than in English words. Path structures are eventually comprised of primaries.

Primaries such as identifiers, operators, literals, numbers, and strings are the atomic structures of Groovy grammar, just as morphemes are the atomic structure of English. The operators ( + * % || && ) and literals (true, null, this) are like grammatical morphemes, while identifiers are like lexical morphemes.


Summary
The matching between Groovy and English syntactic rank isn't perfect, especially for the lower ranks, but does go a fair way. Groovy copies its 5-rank structure from other programming languages, such as Java and C++. I suspect the matching 5-rank structure makes it easier for English speakers to write and read programs in these languages. Perhaps people who don't already know Java or C++ would benefit from seeing comparisons with English grammar when they learn Groovy.

2 comments:

Shawn Hartsock said...

Very interesting, I don't know much about linguistics but I've been learning more about the topic lately. Since a language can constrict what we may express, I've been wondering if when designing programming languages we aren't artificially constricting what we can do with a modern computer.

So, does this mean that modern programming languages are easier to use for English speakers? (this goes deeper than the use of English keywords it matches language structure.) And if so, does it mean that we are limiting our programming expressiveness by staying so close to English?

The rank matching explanation goes a long way to explaining the popularity of certain languages... and the difficulty some concepts see in adoption.

However, the code samples aren't particularly readable. I've found the PLEAC project very helpful: http://pleac.sourceforge.net/

Gavin Grover said...

You said: "So, does this mean that modern programming languages are easier to use for English speakers? (this goes deeper than the use of English keywords it matches language structure.) And if so, does it mean that we are limiting our programming expressiveness by staying so close to English?"

I suspect programmers do conceptualize their programming language within the mindset of their own natural language, just as it's generally accepted that second natural language learners are influenced by their first natural language, especially the grammar.

Even a common English language concept, pronouns, aren't represented as fully in programming languages. I only know of the magic underscore in Python, but English has a much greater variety. We use temporary variables too much in programming. See section 5 here.

I'll take more care with the examples in my next post on the subject :-)