123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417 |
- Lexical Analysis
- ================
- Encoding
- --------
- All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other
- encodings are not supported. Any of the standard platform line termination
- sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
- form using the ASCII sequence CR LF (return followed by linefeed), or the old
- Macintosh form using the ASCII CR (return) character. All of these forms can be
- used equally, regardless of platform.
- Indentation
- -----------
- Nim's standard grammar describes an `indentation sensitive`:idx: language.
- This means that all the control structures are recognized by indentation.
- Indentation consists only of spaces; tabulators are not allowed.
- The indentation handling is implemented as follows: The lexer annotates the
- following token with the preceding number of spaces; indentation is not
- a separate token. This trick allows parsing of Nim with only 1 token of
- lookahead.
- The parser uses a stack of indentation levels: the stack consists of integers
- counting the spaces. The indentation information is queried at strategic
- places in the parser but ignored otherwise: The pseudo terminal ``IND{>}``
- denotes an indentation that consists of more spaces than the entry at the top
- of the stack; ``IND{=}`` an indentation that has the same number of spaces. ``DED``
- is another pseudo terminal that describes the *action* of popping a value
- from the stack, ``IND{>}`` then implies to push onto the stack.
- With this notation we can now easily define the core of the grammar: A block of
- statements (simplified example)::
- ifStmt = 'if' expr ':' stmt
- (IND{=} 'elif' expr ':' stmt)*
- (IND{=} 'else' ':' stmt)?
- simpleStmt = ifStmt / ...
- stmt = IND{>} stmt ^+ IND{=} DED # list of statements
- / simpleStmt # or a simple statement
- Comments
- --------
- Comments start anywhere outside a string or character literal with the
- hash character ``#``.
- Comments consist of a concatenation of `comment pieces`:idx:. A comment piece
- starts with ``#`` and runs until the end of the line. The end of line characters
- belong to the piece. If the next line only consists of a comment piece with
- no other tokens between it and the preceding one, it does not start a new
- comment:
- .. code-block:: nim
- i = 0 # This is a single comment over multiple lines.
- # The scanner merges these two pieces.
- # The comment continues here.
- `Documentation comments`:idx: are comments that start with two ``##``.
- Documentation comments are tokens; they are only allowed at certain places in
- the input file as they belong to the syntax tree!
- Multiline comments
- ------------------
- Starting with version 0.13.0 of the language Nim supports multiline comments.
- They look like:
- .. code-block:: nim
- #[Comment here.
- Multiple lines
- are not a problem.]#
- Multiline comments support nesting:
- .. code-block:: nim
- #[ #[ Multiline comment in already
- commented out code. ]#
- proc p[T](x: T) = discard
- ]#
- Multiline documentation comments also exist and support nesting too:
- .. code-block:: nim
- proc foo =
- ##[Long documentation comment
- here.
- ]##
- Identifiers & Keywords
- ----------------------
- Identifiers in Nim can be any string of letters, digits
- and underscores, beginning with a letter. Two immediate following
- underscores ``__`` are not allowed::
- letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff'
- digit ::= '0'..'9'
- IDENTIFIER ::= letter ( ['_'] (letter | digit) )*
- Currently any Unicode character with an ordinal value > 127 (non ASCII) is
- classified as a ``letter`` and may thus be part of an identifier but later
- versions of the language may assign some Unicode characters to belong to the
- operator characters instead.
- The following keywords are reserved and cannot be used as identifiers:
- .. code-block:: nim
- :file: ../keywords.txt
- Some keywords are unused; they are reserved for future developments of the
- language.
- Identifier equality
- -------------------
- Two identifiers are considered equal if the following algorithm returns true:
- .. code-block:: nim
- proc sameIdentifier(a, b: string): bool =
- a[0] == b[0] and
- a.replace("_", "").toLowerAscii == b.replace("_", "").toLowerAscii
- That means only the first letters are compared in a case sensitive manner. Other
- letters are compared case insensitively within the ASCII range and underscores are ignored.
- This rather unorthodox way to do identifier comparisons is called
- `partial case insensitivity`:idx: and has some advantages over the conventional
- case sensitivity:
- It allows programmers to mostly use their own preferred
- spelling style, be it humpStyle or snake_style, and libraries written
- by different programmers cannot use incompatible conventions.
- A Nim-aware editor or IDE can show the identifiers as preferred.
- Another advantage is that it frees the programmer from remembering
- the exact spelling of an identifier. The exception with respect to the first
- letter allows common code like ``var foo: Foo`` to be parsed unambiguously.
- Historically, Nim was a fully `style-insensitive`:idx: language. This meant that
- it was not case-sensitive and underscores were ignored and there was no even a
- distinction between ``foo`` and ``Foo``.
- String literals
- ---------------
- Terminal symbol in the grammar: ``STR_LIT``.
- String literals can be delimited by matching double quotes, and can
- contain the following `escape sequences`:idx:\ :
- ================== ===================================================
- Escape sequence Meaning
- ================== ===================================================
- ``\n`` `newline`:idx:
- ``\r``, ``\c`` `carriage return`:idx:
- ``\l`` `line feed`:idx:
- ``\f`` `form feed`:idx:
- ``\t`` `tabulator`:idx:
- ``\v`` `vertical tabulator`:idx:
- ``\\`` `backslash`:idx:
- ``\"`` `quotation mark`:idx:
- ``\'`` `apostrophe`:idx:
- ``\`` '0'..'9'+ `character with decimal value d`:idx:;
- all decimal digits directly
- following are used for the character
- ``\a`` `alert`:idx:
- ``\b`` `backspace`:idx:
- ``\e`` `escape`:idx: `[ESC]`:idx:
- ``\x`` HH `character with hex value HH`:idx:;
- exactly two hex digits are allowed
- ================== ===================================================
- Strings in Nim may contain any 8-bit value, even embedded zeros. However
- some operations may interpret the first binary zero as a terminator.
- Triple quoted string literals
- -----------------------------
- Terminal symbol in the grammar: ``TRIPLESTR_LIT``.
- String literals can also be delimited by three double quotes
- ``"""`` ... ``"""``.
- Literals in this form may run for several lines, may contain ``"`` and do not
- interpret any escape sequences.
- For convenience, when the opening ``"""`` is followed by a newline (there may
- be whitespace between the opening ``"""`` and the newline),
- the newline (and the preceding whitespace) is not included in the string. The
- ending of the string literal is defined by the pattern ``"""[^"]``, so this:
- .. code-block:: nim
- """"long string within quotes""""
- Produces::
- "long string within quotes"
- Raw string literals
- -------------------
- Terminal symbol in the grammar: ``RSTR_LIT``.
- There are also raw string literals that are preceded with the
- letter ``r`` (or ``R``) and are delimited by matching double quotes (just
- like ordinary string literals) and do not interpret the escape sequences.
- This is especially convenient for regular expressions or Windows paths:
- .. code-block:: nim
- var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab
- To produce a single ``"`` within a raw string literal, it has to be doubled:
- .. code-block:: nim
- r"a""b"
- Produces::
- a"b
- ``r""""`` is not possible with this notation, because the three leading
- quotes introduce a triple quoted string literal. ``r"""`` is the same
- as ``"""`` since triple quoted string literals do not interpret escape
- sequences either.
- Generalized raw string literals
- -------------------------------
- Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``,
- ``GENERALIZED_TRIPLESTR_LIT``.
- The construct ``identifier"string literal"`` (without whitespace between the
- identifier and the opening quotation mark) is a
- generalized raw string literal. It is a shortcut for the construct
- ``identifier(r"string literal")``, so it denotes a procedure call with a
- raw string literal as its only argument. Generalized raw string literals
- are especially convenient for embedding mini languages directly into Nim
- (for example regular expressions).
- The construct ``identifier"""string literal"""`` exists too. It is a shortcut
- for ``identifier("""string literal""")``.
- Character literals
- ------------------
- Character literals are enclosed in single quotes ``''`` and can contain the
- same escape sequences as strings - with one exception: `newline`:idx: (``\n``)
- is not allowed as it may be wider than one character (often it is the pair
- CR/LF for example). Here are the valid `escape sequences`:idx: for character
- literals:
- ================== ===================================================
- Escape sequence Meaning
- ================== ===================================================
- ``\r``, ``\c`` `carriage return`:idx:
- ``\l`` `line feed`:idx:
- ``\f`` `form feed`:idx:
- ``\t`` `tabulator`:idx:
- ``\v`` `vertical tabulator`:idx:
- ``\\`` `backslash`:idx:
- ``\"`` `quotation mark`:idx:
- ``\'`` `apostrophe`:idx:
- ``\`` '0'..'9'+ `character with decimal value d`:idx:;
- all decimal digits directly
- following are used for the character
- ``\a`` `alert`:idx:
- ``\b`` `backspace`:idx:
- ``\e`` `escape`:idx: `[ESC]`:idx:
- ``\x`` HH `character with hex value HH`:idx:;
- exactly two hex digits are allowed
- ================== ===================================================
- A character is not an Unicode character but a single byte. The reason for this
- is efficiency: for the overwhelming majority of use-cases, the resulting
- programs will still handle UTF-8 properly as UTF-8 was specially designed for
- this. Another reason is that Nim can thus support ``array[char, int]`` or
- ``set[char]`` efficiently as many algorithms rely on this feature. The `Rune`
- type is used for Unicode characters, it can represent any Unicode character.
- ``Rune`` is declared in the `unicode module <unicode.html>`_.
- Numerical constants
- -------------------
- Numerical constants are of a single type and have the form::
- hexdigit = digit | 'A'..'F' | 'a'..'f'
- octdigit = '0'..'7'
- bindigit = '0'..'1'
- HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )*
- DEC_LIT = digit ( ['_'] digit )*
- OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )*
- BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )*
- INT_LIT = HEX_LIT
- | DEC_LIT
- | OCT_LIT
- | BIN_LIT
- INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8'
- INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16'
- INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32'
- INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64'
- UINT_LIT = INT_LIT ['\''] ('u' | 'U')
- UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8'
- UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16'
- UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32'
- UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64'
- exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )*
- FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent)
- FLOAT32_SUFFIX = ('f' | 'F') ['32']
- FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX
- | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX
- FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D'
- FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX
- | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX
- As can be seen in the productions, numerical constants can contain underscores
- for readability. Integer and floating point literals may be given in decimal (no
- prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal
- (prefix ``0x``) notation.
- There exists a literal for each numerical type that is
- defined. The suffix starting with an apostrophe ('\'') is called a
- `type suffix`:idx:. Literals without a type suffix are of the type ``int``,
- unless the literal contains a dot or ``E|e`` in which case it is of
- type ``float``. For notational convenience the apostrophe of a type suffix
- is optional if it is not ambiguous (only hexadecimal floating point literals
- with a type suffix can be ambiguous).
- The type suffixes are:
- ================= =========================
- Type Suffix Resulting type of literal
- ================= =========================
- ``'i8`` int8
- ``'i16`` int16
- ``'i32`` int32
- ``'i64`` int64
- ``'u`` uint
- ``'u8`` uint8
- ``'u16`` uint16
- ``'u32`` uint32
- ``'u64`` uint64
- ``'f`` float32
- ``'d`` float64
- ``'f32`` float32
- ``'f64`` float64
- ``'f128`` float128
- ================= =========================
- Floating point literals may also be in binary, octal or hexadecimal
- notation:
- ``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64``
- is approximately 1.72826e35 according to the IEEE floating point standard.
- Literals are bounds checked so that they fit the datatype. Non base-10
- literals are used mainly for flags and bit pattern representations, therefore
- bounds checking is done on bit width, not value range. If the literal fits in
- the bit width of the datatype, it is accepted.
- Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1
- instead of causing an overflow error.
- Operators
- ---------
- Nim allows user defined operators. An operator is any combination of the
- following characters::
- = + - * / < >
- @ $ ~ & % |
- ! ? ^ . : \
- These keywords are also operators:
- ``and or not xor shl shr div mod in notin is isnot of``.
- `=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they
- are used for other notational purposes.
- ``*:`` is as a special case treated as the two tokens `*`:tok: and `:`:tok:
- (to support ``var v*: T``).
- Other tokens
- ------------
- The following strings denote other tokens::
- ` ( ) { } [ ] , ; [. .] {. .} (. .)
- The `slice`:idx: operator `..`:tok: takes precedence over other tokens that
- contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok:
- and not the two tokens `{.`:tok:, `.}`:tok:.
|