Friday, June 20, 2008

Word Classes in English and Groovy

When I was in primary school, I learnt that English had 8 parts of speech: nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions, and articles. Nowadays linguists call them word classes. Since working in Tesol, I've learnt that words in English are better classified as falling somewhere along a continuum, with conjunctions, the most grammatical words, at one end, and proper nouns, the most lexical, at the other.

We'll take a quick look at these word classes in English grammar, then look at the similar concept in the Groovy Language. (Note: The English grammar is very simple, and based on what I remember from personal reading, not academic study, so I don't guarantee total correctness).

Word Classes in English

The most grammatical words in English are and, or, and not, the same operators as in propositional logic. and and or can be used to join together most words anywhere further along the continuum. Most obvious are the lexical words, e.g:
  the book, the pen, and the pad (nouns)
  black and blue (adjectives)
  slowly and carefully (adverbs)
  to stop, listen, and sing (verbs)
  to put up or shut up (phrasal verbs)

Also, multiword lexical forms, such as phrases and clauses, can be similarly joined:
  the house, dark blue and three storeys high, ... (adjectival phrases)
  The batter hit the ball and the fielder caught it. (clauses)

But more grammatical words at the same position on the continuum can be joined:
  your performance is over and above expectations (prepositions)
  I could and should go (auxiliary verbs)
  They were and are studying (different type of auxiliary verbs)
  this and that (pronouns)

Incidentally, the continuum can have more than one type of multiword form at the same position, such as adverbials and prepositional phrases:
  They walked, very silently and with great care, ...

and and or are 2 of only 7 conjunctions in English, memorized by the acronym FANBOYS: for, and, nor, but, or, yet, and so. But and and or are more grammatical than the other five conjunctions, and can be used to join the others together, e.g:
  It was difficult, yet and so I tried.

The propositional logic operators are the most grammatical words in English.

Next along the continuum are proforms, words that take the place of other more lexical words. In English, the most common type of proform is the pronoun, e.g. he, she, this, which also has determiner form, e.g. his, hers. For example:
  The dog chased the cat, but lost it. (pronoun: it)
  The dog escaped from the goat, but lost its collar. (determiner: its)

Other word classes and multiword forms have proforms. For example, pro-verb do/did:
  I enjoyed the film, and so did the ushers.
Gap for pro-verb:
  We found the south exit, and the other team, the north exit.
Pro-adjective such:
  We experienced a humid day, and also such a night.
Pro-adverb thus:
  Swiftly the Italians played; thus also did the Brazilians.
Proform for multiword adverbial so:
  The programmers finished totally on time; so did the testers.

Next are a large number of miscellaneous words between grammatical and lexical, which some call particles. Examples are interjections, articles (a/an/the), phrasal verb particles, conjunctive adverbs, sentence connectors, verb auxiliaries, not, only, infinitive's to, etc.

English, and I guess every natural language, is really a mess, and the particles are a way of categorizing the messy stuff.

Prepositions and Verbs
The first lexical word class along the continuum is the prepositions. In Hallidayan Functional Grammar, they're considered to be reduced verbs. Some examples: under, over, through, in. There are also multiword prepositional groups, e.g: up to, out of, with respect to, in lieu of.

Further along the continuum are the verbs, e.g. listen, write, walk. Verbs can be multiword, such as phrasal verbs, e.g. put up, shut up, prepositional phrasal verbs, e.g. get on with, put up with, and verb groups, e.g: will be speaking, has walked, to have gotten on with.

Adjectives and Nouns
Next along the continuum are adjectives, e.g. black, blacker, blackest. In Chomskian Transformational Grammar, adverbs ending in -ly are considered to be the same as adjectives, only modified at the surface level, e.g. slowly, slower, slowest.

Adjectives/adverbs can be multiword, e.g:
  The building is three storeys high. (adjectival phrase)
  That cat walks incredibly slowly. (adverb word group)

Next are common nouns, both count nouns, e.g. pen, pens, and mass nouns, e.g. coffee, hope. Nouns can be built into noun phrases, e.g. the long dark blue pen.

Just as verbs and prepositions are related, so are nouns and adjectives. Abstract ideas often only differ grammatically, e.g:
  Jack is very hopeful.
  Jack has much hope.
  Jack has many hopes.

At the lexical end of the grammar-lexis continuum are proper nouns. These can be phrases we construct from other words, e.g. the Speaker's Tavern, foreign words, e.g. pyjamas, fooyung, or even invented words, e.g. Kodak, Pepsi.

The largest word class in English are the nouns, then the adjectives, then verbs. When new words enter English, they're usually nouns. Some will become adjectives and maybe verbs, but very few ever move further along the continuum towards the grammar end. Although English has many Norman words from 800 or 900 years ago, very few are prepositions, and all the other more grammatical words came from Anglo-Saxon.

Perhaps all natural languages have a word class continuum with prepositional logic words at one end, and definable nouns at the other.

Word Classes in Groovy
Groovy uses both symbols and alphanumeric keywords for grammar, both lexed and parsed grammar. Groovy builds on Java, and hence C++ and C, for its tokens.

Bracketing and Separators
Perhaps the most grammatical along the continuum are the various bracketing symbols. Some have different tokens for opening and closing, e.g:
  /* */ ( ) [ ] { } < >
while others use the same token for both, e.g:
  """ ''' " ' /
There's no corresponding word class in English because English uses prosody (tone, stress, pause, etc) rather than words for the bracketing function.

Next along Groovy's continuum could be separators, e.g:
  , ; : ->
We can use , and ; for lists of elements, similar to and and or in English.

Groovy has a very limited repertoire of pronouns, only this and it.

Verbs and Prepositions
Perhaps operators are like English prepositions, e.g:
  == != > >= < <= <=>
  .. ..< ?: ? : . .@ ?. *. .& ++ -- + - * / % **
  & | ^ ! ~ << >> >>> && || =~ ==~

while some operators are almost like verbs, e.g:
  = += -= *= /= %= **= &= |= ^= <<= >>= >>>=

Some operators are represented by keywords in Groovy, viz. prepositions, an adjective, and a multiword noun-preposition, i.e:
  in as new instanceof

Verbs in indicative form are used in definitions, e.g:
throws extends implements
... .*

The most common verb form is the imperative, e.g:
  switch, do, try, catch, assert, return, throw
  break, continue, import, def, goto

though sometimes English adverbs are used as commands in Groovy, e.g:
  if, else, while, for, finally
Also used for this are nouns, e.g:
  case, default
and symbols, e.g:
  \ // #! $

Nouns and Adjectives
Groovy uses English adjectives for adjectival functions in Groovy, e.g:
  public, protected, private, abstract, final, static
  transient, volatile, strictfp, synchronized, native, const

Groovy has many built-in Groovy common nouns, e.g:
  class, interface, enum, package
  super, true, false, null
  25.49f, \u004F, 0x7E, 123e7

Some of them can also be used like adjectives, e.g:
  boolean, char, byte, short, int, long, float, double, void
are nouns (types) that can precede other nouns (variables), like Toy in A Toy Story.

We can define our own Groovy proper nouns using letters, digits, underscore, and dollar sign, e.g:
  MY_NAME, closure$17, αβγδε

Using @, we can also define our own Groovy adjectives.

Because Groovy is syntactically derived from Java, and hence from C++ and C, it, like English, is a little messy in its choice of tokens.

Notice also the different emphasis of word classes between English and Groovy, e.g:
  • Groovy uses tokens for bracketing while English uses non-token cues
  • English uses far more proforms than Groovy, which forces us to use temporary variables a lot
  • English uses Huffman coding by shortening common words like prepositions, while Groovy retains instanceof and implements

Conclusion: The Unicode Future
Unicode divides its tokens into different categories: letters (L), marks (M), separators (Z), symbols (S), numbers (N), punctuation (P), and other (C). Within each are various sub-categories. I'm looking at how best to use all Unicode characters (not just CJK ones) when extending a Java-like language such as Groovy with more tokens. The more tokens a language has, the terser it can be written while retaining clarity. Unicode's now a standard, so perhaps programmers will be more motivated to learn them than when APL was released. And modern IME's enable all tokens to be entered easily, for example, my ideas for the Latin-1 characters. Such a Unicode-based grammar must be backwards-compatible, Huffman coded, and easy to enter in the keyboard.

No comments: