Chapter 3
Lexical Structure
The organization of this chapter parallels the chapter on
Lexical Structure in the
Java Language Specification (second edition), which begins as follows:
This chapter specifies the lexical structure of the Java programming language.
Programs are written in Unicode (§3.1, JLS), but lexical translations are provided (§3.2, JLS) so that Unicode escapes (§3.3, JLS) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§3.4, JLS) to support the different conventions of existing host systems while maintaining consistent line numbers.
The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (§3.5, JLS), which are white space (§3.6, JLS), comments (§3.7, JLS), and tokens. The tokens are the identifiers (§3.8, JLS), keywords (§3.9, JLS), literals (§3.10, JLS), separators (§3.11, JLS), and operators (§3.12, JLS) of the syntactic grammar.
3.1 Unicode
(Cf.
JLS. §3.1.)
(Unchanged.)
3.2 Lexical Translations
(Cf.
JLS. §3.2.)
See following sections for changes in the lexical translation phases.
Here are some examples that suggest some of the changes that pertain to newlines and string literals:
println 'Say "Hi", world.' //friendly suggestion
{print "Magic="; println "cafebabe"}
print "foo"+
"tball"; println " heroes"
println ("bar"
+(buff? "bells": "bies"))
String how /* sneaky
newline */ = "real"
println "value of \$how.length is '$how.length'."
println "It's been $how.";
3.3 Unicode Escapes
(Cf.
JLS. §3.3.)
TO DO: Some implementations do not support '\uXXXX' escapes, except in strings. Mandate them anyway?
3.4 Line Terminators
(Cf.
JLS. §3.4.)
In both Groovy and Java, a line terminator is either an ASCII CR or an ASCII LF character, or both (in that order).
In Groovy, some line terminators are syntactically significant. As a result, the lexical grammar differs from that of Java.
Inside a few specific contexts (comments and brackets) line terminators are deemed to be insignificant and are classified as whitespae, as in Java.
Other line terminators are called _significant newlines_, and are reclassified as the token
SignificantNewline.
This reclassification is part of the the third lexical transformation phase. See
§3.11.1.
3.5 Input Elements and Tokens
(Cf.
JLS. §3.5.)
There are two major differences from Java in Groovy tokenization.
Some line terminators are reclassified (during tokenization) as significant newlines (
§3.11.1), which are tokens presented to the grammar.
Also, string constructor expressions allow string literals to be interrupted by Groovy expressions, which compute non-constant parts of a desired string.
3.6 White Space
(Cf.
JLS. §3.6.)
Retain
WhiteSpace: LineTerminator, but recall that some line terminators are transformed into significant newlines.
3.7 Comments
(Cf.
JLS. §3.7.)
Comment syntaxes are the same as in Java.
Note that the newline terminating an "end of line" comment can be significant.
In addition, if the first non-space character of a Groovy program is the ASCII sharp sign ('#'), the whole line is treated as a comment. In other words, the program is treated exactly as if two slash characters ('//') were inserted before the sharp sign.
This unusual rule makes it easier to write Groovy scripts on some systems.
3.8 Identifiers
(Cf.
JLS. §3.8.)
Groovy identifiers differ from Java identifiers in that the ASCII dollar character '$' is not a legal identifier character. This is restriction applies in practice only to the spelling of unqualified names, since Groovy provides a way to use any Unicode string whatever as a member name or command name.
(The dollar sign is sometimes used internally by Groovy to mangle non-Java identifiers which must be converted to Java names. For this reason, it would be confusing to allow unescaped dollar signs as Groovy identifier constituents.)
3.9 Keywords
(Cf.
JLS. §3.9.)
The following words are keywords in Groovy but not in Java:
The following words are keywords in Java and Groovy, but are currently illegal in Groovy:
3.10 Literals
(Cf.
JLS. §3.10.)
Groovy extends the set of literal syntaxes to support a wider variety of strings, including strings with non-constant parts.
It also supports literal constants of type
BigInteger and
BigDecimal.
3.10.1 Integer Literals
(Cf.
JLS. §3.10.1.)
The production
IntegerTypeSuffix: g G is added, allowing
BigInteger constants.
TO DO: 123i allowed? Other literal syntaxes?
Because numbers are objects in Groovy,
numeric literals are not allowed to begin or end with adecimal point.
Java floating point literals such as
.01,
1.f, and
1.e10 must be padded with zero digits,
as in
0.01,
1.0f, and
1.0e10.
assert '123' == 123.toString
assert '1.23' == 1.23.toString
3.10.2 Floating-Point Literals
(Cf.
JLS. §3.10.2.)
The production
FloatTypeSuffix: g G is added, allowing
BigDecimal constants.
3.10.3 Boolean Literals
(Cf.
JLS. §3.10.3.)
(No change.)
3.10.4 Character Literals
(Cf.
JLS. §3.10.4.)
Groovy has no
CharacterLiteral token. All literals with character data in them denote strings. Constant strings of unit length serve in the place of character literals, since they coerce properly to character constants.
3.10.5 String Literals
(Cf.
JLS. §3.10.5.)
Groovy string literals have a syntax inspired by other scripting languages. A string literal may be delimited by either single or double quotes. String literals with double quotes may incorporate substring substitution expressions, while singly-quoted string literals are always constants.
If a double-quoted string contains an unescaped dollar sign, it is more properly called a
string constructor, since it evaluates to a non-constant string, whose contents depend on expressions following the dollar signs.
Indeed, a string constructor is really an expression composed of a number of tokens, only some of which are string literals.
Independently, the quote marks may be tripled, allowing the string to span multiple lines. If the quote marks are used singly, the string may not contain a line terminator.
Regardless of the spelling of a line terminator found inside a string literal or constructor, it is always equivalent to an escaped newline '\n'.
(Note: The grammar of single-quoted and triple-quoted strings differs only in their processing of line terminators.
We express this by means of a grammatical parameter
LT, which is true if line terminators are allowed.)
StringLiteral:
' (CStringCharacter[LT=false])* '
" (DStringCharacter[LT=false])* "
''' (CStringCharacter[LT=true])* '''
""" (DStringCharacter[LT=true])* """CStringCharacter[LT]:
InputCharacter but not a closing quote or \
EscapeSequence[LT]
LineTerminator when(LT)DStringCharacter[LT]:
InputCharacter but not a closing quote or \ or $
EscapeSequence[LT]
LineTerminator when(LT)EscapeSequence[LT]: (add the following new productions)
\ $
\ LineTerminator when(LT)
Note that a triple-quoted string may contain isolated single or double occurrences of its close-quote.
For example:
assert """x""" == 'x'
assert """""" == ''
assert '''''"''' == "''" + '"'
3.10.5.1 String Constructors
If a double-quoted string contains an unescaped dollar character ('$'), then it is not a string literal, but rather a _string constructor_, which is a sequence of tokens that comprise an expression for a string containing constant and non-constant parts.
Each unescaped dollar character must be followed by a _value part_, which is a Groovy expression or statement.
The value part is parsed and evaluated as if it had occured in expression parentheses.
Apart from such dollars and value parts, every other character between the opening and closing string quotes is treated literally, as in the case of normal string literals.
Thus, string constructor expression consists of alternating literal parts and values.
During the tokenization phase of translation, the end of a literal part is determined by the occurrence of an unescaped dollar character, or (at the end) by the appropriate closing quote.
The end of a value part is determined either by eagerly parsing a series of dot-separated identifiers, or by parsing a compound block, with its balanced curly braces.
String x = "X"
assert x == "$x"
assert x == "${x}"
assert "XoX" == "${x}o$x"
assert '$'+'x' == "\$x"
assert "1" == "$x.length"
assert "1." == "$x.length."
assert "1()" == "$x.length()"
assert "1+2" == "$x.length+2"
assert "x . length" == "$x . length"
assert "zXY;" == """z${
String y = "Y"
("$x$y") };"""
The process of parsing an embedded block may be viewed as an approximate parse which attends only to the balancing of curly brace tokens.
(Some implementations may be able to perform string constructor tokenization and parsing in one coroutined pass, but the specification does not require this.)
The tokenization of a string constructor must follow the format of a
GStringLexicalForm.
This is not a production of the Groovy expression grammar, but rather a syntax which must govern the separation of literal parts from expressions.
(Note: The term
GString refers to a class of string-like object created by a string constructor expression.)
GStringLexicalForm[LT]:
GStringStart[LT] GStringValuePart[LT]
(GStringMiddle[LT] GStringValuePart[LT])*
GStringEnd[LT]GStringStart[LT]
" (DStringCharacter[LT])* when(!LT)
""" (DStringCharacter[LT])* when(LT)GStringMiddle[LT]
(DStringCharacter[LT])*GStringEnd[LT]
(DStringCharacter[LT])* " when(!LT)
(DStringCharacter[LT])* """ when(!LT)GStringValuePart[LT]:
$ RawIdentifier ( . RawIdentifier )*
$ { (GStringToken[LT])* }GStringLiteralPart[LT]:
GStringStart[LT]
GStringMiddle[LT]
GStringEnd[LT]GStringToken[LT]:
InputElement but not { } LineTerminator
{ GStringToken[LT]* }
GStringLexicalForm[LT=false]
GStringLexicalForm[LT=true] when(LT)
LineTerminator when(LT)
In this way, string constructors are recognized lexically as a complex of
GStringLiteralParts and other tokens, according to the grammar of
GStringLexicalForm, which does not recognize statement or expression structure, except to balance brackets.
In the resulting mix of tokens, the various literal parts are left as-is, for later parsing as a
StringConstructorExpression by the grammar.
Each
RawIdentifier is reinterpreted as a normal
Identifier, or as a keyword, if its spelling is recognized as such.
At that point, the dollar signs are insignificant, but are left in the grammar for clarity.
Token:
GStringLiteralPart[LT=false]
GStringLiteralPart[LT=true]StringConstructorExpression[LT]:
GStringStart[LT] $ Statement
( GStringMiddle[LT] $ Statement )*
GStringEnd[LT]
Identifiers and dots after a dollar sign are parsed into
GStringTokens eagerly, even though their characters could also be validly parsed as string characters inside a following
GStringValuePart.
Note that a dot is taken to be part of an embedded expression only if it is followed by a letter or underscore.
Also, a whitespace character always interrupts the eager parsing of a name.
When eagerly parsing an identifier after a dollar, keyword recognition is not performed.
The following program fragment may not be accepted as a valid Groovy program, because the token 'int' may not follow a dot '.':
String x = "", y = "$x.int"
A string constructor expression which is introduced with a single double-quote character must occur all on one line.
It is an error for the value parts of such an expression to contain line terminators of any sort.
The following program fragment may not be accepted as a valid Groovy program, unless the double quotes are tripled:
println "Hello, ${
'world'}."
Reference:
http://archive.groovy.codehaus.org/jsr/threads/iakbeiefedohmiddhked
3.10.6 Escape Sequences for Character and String Literals
(Cf.
JLS. §3.10.6.)
(Same as in Java.)
In all strings, the escape sequence '\$' is legal, and stands for the ASCII dollar character.
(Such an escaped dollar does not introduce a string constructor value part.)
3.10.7 The Null Literal
(Cf.
JLS. §3.10.7.)
(Unchanged.)
3.11 Separators
(Cf.
JLS. §3.11.)
Add
Separator: SignificantNewline to the Java grammar.
3.11.1 Significant Newlines
As tokenization proceeds from left to right, some line terminators are reclassified as significant newlines, which then serve as alternatives to semicolons in the grammar of statements.
As in Java, a LineTerminator
? token can occur inside a TraditionalComment, and is always insignificant.
A line terminator inside a string literal is always lexically significant. A line terminator which occurs within an expression embedded in a string constructor may also be significant.
Otherwise, a LineTerminator
? is deemed to be insignificant if it is enclosed in parentheses or square bracket tokens, but not (in a certain precise sense) more closely in curly braces.
Specifically, the left context of a LineTerminator, viewed in isolation, is converted to tokens.
All tokens other than unmatched separators are removed, leaving (if the program is well-formed) as sequence of left parentheses, left square brackets, and left curly braces.
The LineTerminator
? is regarded as significant if and only if the resulting sequence of separators is either empty, or ends in a curly brace.
Significant newlines are specifically allowed to occur inside non-parenthesized expressions and in end-of-line comments.
(As in Java, no newline of any sort may occur within plain string quotes. The meaning of newlines in strings is described in 3.10.5.)
A significant newline is
lexically significant, in that the grammar must somehow account for it in the token stream. However, in many places in the grammar, significant newlines are accepted but discarded.
Many tokens can be regarded as "leaning rightward" (i.e., they are linguistically proclitic) because they clearly require additional tokens to complete a statement.
For example, a colon or plus sign never ends a statement, but always requires further tokens.
This notion is formalized in the Groovy grammar, where significant newlines are allowed and discarded after such rightward-leaning tokens.
Such tokens include prefix and infix expression operators; comma, colon, and dot; and certain keywords (such as "throws") which contribute to declaration syntax.
In a few cases, tokens lean rightward only in some contexts. For example, the
++ operator leans rightward only if it is a prefix operator.
These rules provide for easy continuation of long statements or expressions onto multiple lines, without a need to explicitly escape the intermediate line terminators.
Generally, expression statements will terminate at newlines if the expression is complete, even if further tokens on the next line would add to the expression.
If a programmer breaks a long expression by inserting a newline after an operator, the Groovy and Java languages will agree on the statement continuation.
Following Java formatting habits, a programmer may also insert a line-breaking newline before an operator. In such cases, the Groovy parser will detect the error, since the following line fragment, even if it happens to parse as an expression, will amount to an illegal expression statement
(§14.8).
If there is doubt about a specific expression, enclose it in parentheses to make clear the intended grouping, and disable significant newlines.
3.11.2 Grammatical Significance of Newlines
For simplicity in subsequent chapters, the following two grammar rules define all occurrences of significant newlines.
NLS:
(SignificantNewline)*
SEP:
';' NLS
SignificantNewline (';')? NLS
Wherever the grammar allows a semicolon separator token, the grammar also accepts any number of significant newlines instead of or in addition to the semicolon. This is always indicated in subsequent chapters by an occurrences of the
SEP nonterminal.
Syntactically insignificant (and optional) newlines are indicated by occurrences of the
NLS nonterminal. There are no other uses of the
SignificantNewline element.
Unlike Java, but like Pascal, semicolon functions as a statement separator, not a statement terminator. A statement just before an enclosing right bracket is terminated with or without a final separator token. Like certain scripting languages (such as
sh and
awk), semicolons and significant newlines are interchangeable as statement separators.
println x //SIG
println x; //insig
println x /*… insig
...*/ + y
println x + //insig
y
println (x //insig
+ y)
println ({ x //SIG
y })
3.12 Operators
(Cf.
JLS. §3.12.)
The following productions are added:
Specification
Table of Contents.
The organization of this chapter parallels the chapter on
Lexical Structure in the
Java Language Specification (second edition).
The original of this specification is at
http://docs.codehaus.org/display/GroovyJSR.