lexing.txt 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417
  1. Lexical Analysis
  2. ================
  3. Encoding
  4. --------
  5. All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other
  6. encodings are not supported. Any of the standard platform line termination
  7. sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
  8. form using the ASCII sequence CR LF (return followed by linefeed), or the old
  9. Macintosh form using the ASCII CR (return) character. All of these forms can be
  10. used equally, regardless of platform.
  11. Indentation
  12. -----------
  13. Nim's standard grammar describes an `indentation sensitive`:idx: language.
  14. This means that all the control structures are recognized by indentation.
  15. Indentation consists only of spaces; tabulators are not allowed.
  16. The indentation handling is implemented as follows: The lexer annotates the
  17. following token with the preceding number of spaces; indentation is not
  18. a separate token. This trick allows parsing of Nim with only 1 token of
  19. lookahead.
  20. The parser uses a stack of indentation levels: the stack consists of integers
  21. counting the spaces. The indentation information is queried at strategic
  22. places in the parser but ignored otherwise: The pseudo terminal ``IND{>}``
  23. denotes an indentation that consists of more spaces than the entry at the top
  24. of the stack; ``IND{=}`` an indentation that has the same number of spaces. ``DED``
  25. is another pseudo terminal that describes the *action* of popping a value
  26. from the stack, ``IND{>}`` then implies to push onto the stack.
  27. With this notation we can now easily define the core of the grammar: A block of
  28. statements (simplified example)::
  29. ifStmt = 'if' expr ':' stmt
  30. (IND{=} 'elif' expr ':' stmt)*
  31. (IND{=} 'else' ':' stmt)?
  32. simpleStmt = ifStmt / ...
  33. stmt = IND{>} stmt ^+ IND{=} DED # list of statements
  34. / simpleStmt # or a simple statement
  35. Comments
  36. --------
  37. Comments start anywhere outside a string or character literal with the
  38. hash character ``#``.
  39. Comments consist of a concatenation of `comment pieces`:idx:. A comment piece
  40. starts with ``#`` and runs until the end of the line. The end of line characters
  41. belong to the piece. If the next line only consists of a comment piece with
  42. no other tokens between it and the preceding one, it does not start a new
  43. comment:
  44. .. code-block:: nim
  45. i = 0 # This is a single comment over multiple lines.
  46. # The scanner merges these two pieces.
  47. # The comment continues here.
  48. `Documentation comments`:idx: are comments that start with two ``##``.
  49. Documentation comments are tokens; they are only allowed at certain places in
  50. the input file as they belong to the syntax tree!
  51. Multiline comments
  52. ------------------
  53. Starting with version 0.13.0 of the language Nim supports multiline comments.
  54. They look like:
  55. .. code-block:: nim
  56. #[Comment here.
  57. Multiple lines
  58. are not a problem.]#
  59. Multiline comments support nesting:
  60. .. code-block:: nim
  61. #[ #[ Multiline comment in already
  62. commented out code. ]#
  63. proc p[T](x: T) = discard
  64. ]#
  65. Multiline documentation comments also exist and support nesting too:
  66. .. code-block:: nim
  67. proc foo =
  68. ##[Long documentation comment
  69. here.
  70. ]##
  71. Identifiers & Keywords
  72. ----------------------
  73. Identifiers in Nim can be any string of letters, digits
  74. and underscores, beginning with a letter. Two immediate following
  75. underscores ``__`` are not allowed::
  76. letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff'
  77. digit ::= '0'..'9'
  78. IDENTIFIER ::= letter ( ['_'] (letter | digit) )*
  79. Currently any Unicode character with an ordinal value > 127 (non ASCII) is
  80. classified as a ``letter`` and may thus be part of an identifier but later
  81. versions of the language may assign some Unicode characters to belong to the
  82. operator characters instead.
  83. The following keywords are reserved and cannot be used as identifiers:
  84. .. code-block:: nim
  85. :file: ../keywords.txt
  86. Some keywords are unused; they are reserved for future developments of the
  87. language.
  88. Identifier equality
  89. -------------------
  90. Two identifiers are considered equal if the following algorithm returns true:
  91. .. code-block:: nim
  92. proc sameIdentifier(a, b: string): bool =
  93. a[0] == b[0] and
  94. a.replace("_", "").toLowerAscii == b.replace("_", "").toLowerAscii
  95. That means only the first letters are compared in a case sensitive manner. Other
  96. letters are compared case insensitively within the ASCII range and underscores are ignored.
  97. This rather unorthodox way to do identifier comparisons is called
  98. `partial case insensitivity`:idx: and has some advantages over the conventional
  99. case sensitivity:
  100. It allows programmers to mostly use their own preferred
  101. spelling style, be it humpStyle or snake_style, and libraries written
  102. by different programmers cannot use incompatible conventions.
  103. A Nim-aware editor or IDE can show the identifiers as preferred.
  104. Another advantage is that it frees the programmer from remembering
  105. the exact spelling of an identifier. The exception with respect to the first
  106. letter allows common code like ``var foo: Foo`` to be parsed unambiguously.
  107. Historically, Nim was a fully `style-insensitive`:idx: language. This meant that
  108. it was not case-sensitive and underscores were ignored and there was no even a
  109. distinction between ``foo`` and ``Foo``.
  110. String literals
  111. ---------------
  112. Terminal symbol in the grammar: ``STR_LIT``.
  113. String literals can be delimited by matching double quotes, and can
  114. contain the following `escape sequences`:idx:\ :
  115. ================== ===================================================
  116. Escape sequence Meaning
  117. ================== ===================================================
  118. ``\n`` `newline`:idx:
  119. ``\r``, ``\c`` `carriage return`:idx:
  120. ``\l`` `line feed`:idx:
  121. ``\f`` `form feed`:idx:
  122. ``\t`` `tabulator`:idx:
  123. ``\v`` `vertical tabulator`:idx:
  124. ``\\`` `backslash`:idx:
  125. ``\"`` `quotation mark`:idx:
  126. ``\'`` `apostrophe`:idx:
  127. ``\`` '0'..'9'+ `character with decimal value d`:idx:;
  128. all decimal digits directly
  129. following are used for the character
  130. ``\a`` `alert`:idx:
  131. ``\b`` `backspace`:idx:
  132. ``\e`` `escape`:idx: `[ESC]`:idx:
  133. ``\x`` HH `character with hex value HH`:idx:;
  134. exactly two hex digits are allowed
  135. ================== ===================================================
  136. Strings in Nim may contain any 8-bit value, even embedded zeros. However
  137. some operations may interpret the first binary zero as a terminator.
  138. Triple quoted string literals
  139. -----------------------------
  140. Terminal symbol in the grammar: ``TRIPLESTR_LIT``.
  141. String literals can also be delimited by three double quotes
  142. ``"""`` ... ``"""``.
  143. Literals in this form may run for several lines, may contain ``"`` and do not
  144. interpret any escape sequences.
  145. For convenience, when the opening ``"""`` is followed by a newline (there may
  146. be whitespace between the opening ``"""`` and the newline),
  147. the newline (and the preceding whitespace) is not included in the string. The
  148. ending of the string literal is defined by the pattern ``"""[^"]``, so this:
  149. .. code-block:: nim
  150. """"long string within quotes""""
  151. Produces::
  152. "long string within quotes"
  153. Raw string literals
  154. -------------------
  155. Terminal symbol in the grammar: ``RSTR_LIT``.
  156. There are also raw string literals that are preceded with the
  157. letter ``r`` (or ``R``) and are delimited by matching double quotes (just
  158. like ordinary string literals) and do not interpret the escape sequences.
  159. This is especially convenient for regular expressions or Windows paths:
  160. .. code-block:: nim
  161. var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab
  162. To produce a single ``"`` within a raw string literal, it has to be doubled:
  163. .. code-block:: nim
  164. r"a""b"
  165. Produces::
  166. a"b
  167. ``r""""`` is not possible with this notation, because the three leading
  168. quotes introduce a triple quoted string literal. ``r"""`` is the same
  169. as ``"""`` since triple quoted string literals do not interpret escape
  170. sequences either.
  171. Generalized raw string literals
  172. -------------------------------
  173. Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``,
  174. ``GENERALIZED_TRIPLESTR_LIT``.
  175. The construct ``identifier"string literal"`` (without whitespace between the
  176. identifier and the opening quotation mark) is a
  177. generalized raw string literal. It is a shortcut for the construct
  178. ``identifier(r"string literal")``, so it denotes a procedure call with a
  179. raw string literal as its only argument. Generalized raw string literals
  180. are especially convenient for embedding mini languages directly into Nim
  181. (for example regular expressions).
  182. The construct ``identifier"""string literal"""`` exists too. It is a shortcut
  183. for ``identifier("""string literal""")``.
  184. Character literals
  185. ------------------
  186. Character literals are enclosed in single quotes ``''`` and can contain the
  187. same escape sequences as strings - with one exception: `newline`:idx: (``\n``)
  188. is not allowed as it may be wider than one character (often it is the pair
  189. CR/LF for example). Here are the valid `escape sequences`:idx: for character
  190. literals:
  191. ================== ===================================================
  192. Escape sequence Meaning
  193. ================== ===================================================
  194. ``\r``, ``\c`` `carriage return`:idx:
  195. ``\l`` `line feed`:idx:
  196. ``\f`` `form feed`:idx:
  197. ``\t`` `tabulator`:idx:
  198. ``\v`` `vertical tabulator`:idx:
  199. ``\\`` `backslash`:idx:
  200. ``\"`` `quotation mark`:idx:
  201. ``\'`` `apostrophe`:idx:
  202. ``\`` '0'..'9'+ `character with decimal value d`:idx:;
  203. all decimal digits directly
  204. following are used for the character
  205. ``\a`` `alert`:idx:
  206. ``\b`` `backspace`:idx:
  207. ``\e`` `escape`:idx: `[ESC]`:idx:
  208. ``\x`` HH `character with hex value HH`:idx:;
  209. exactly two hex digits are allowed
  210. ================== ===================================================
  211. A character is not an Unicode character but a single byte. The reason for this
  212. is efficiency: for the overwhelming majority of use-cases, the resulting
  213. programs will still handle UTF-8 properly as UTF-8 was specially designed for
  214. this. Another reason is that Nim can thus support ``array[char, int]`` or
  215. ``set[char]`` efficiently as many algorithms rely on this feature. The `Rune`
  216. type is used for Unicode characters, it can represent any Unicode character.
  217. ``Rune`` is declared in the `unicode module <unicode.html>`_.
  218. Numerical constants
  219. -------------------
  220. Numerical constants are of a single type and have the form::
  221. hexdigit = digit | 'A'..'F' | 'a'..'f'
  222. octdigit = '0'..'7'
  223. bindigit = '0'..'1'
  224. HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )*
  225. DEC_LIT = digit ( ['_'] digit )*
  226. OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )*
  227. BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )*
  228. INT_LIT = HEX_LIT
  229. | DEC_LIT
  230. | OCT_LIT
  231. | BIN_LIT
  232. INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8'
  233. INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16'
  234. INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32'
  235. INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64'
  236. UINT_LIT = INT_LIT ['\''] ('u' | 'U')
  237. UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8'
  238. UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16'
  239. UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32'
  240. UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64'
  241. exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )*
  242. FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent)
  243. FLOAT32_SUFFIX = ('f' | 'F') ['32']
  244. FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX
  245. | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX
  246. FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D'
  247. FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX
  248. | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX
  249. As can be seen in the productions, numerical constants can contain underscores
  250. for readability. Integer and floating point literals may be given in decimal (no
  251. prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal
  252. (prefix ``0x``) notation.
  253. There exists a literal for each numerical type that is
  254. defined. The suffix starting with an apostrophe ('\'') is called a
  255. `type suffix`:idx:. Literals without a type suffix are of the type ``int``,
  256. unless the literal contains a dot or ``E|e`` in which case it is of
  257. type ``float``. For notational convenience the apostrophe of a type suffix
  258. is optional if it is not ambiguous (only hexadecimal floating point literals
  259. with a type suffix can be ambiguous).
  260. The type suffixes are:
  261. ================= =========================
  262. Type Suffix Resulting type of literal
  263. ================= =========================
  264. ``'i8`` int8
  265. ``'i16`` int16
  266. ``'i32`` int32
  267. ``'i64`` int64
  268. ``'u`` uint
  269. ``'u8`` uint8
  270. ``'u16`` uint16
  271. ``'u32`` uint32
  272. ``'u64`` uint64
  273. ``'f`` float32
  274. ``'d`` float64
  275. ``'f32`` float32
  276. ``'f64`` float64
  277. ``'f128`` float128
  278. ================= =========================
  279. Floating point literals may also be in binary, octal or hexadecimal
  280. notation:
  281. ``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64``
  282. is approximately 1.72826e35 according to the IEEE floating point standard.
  283. Literals are bounds checked so that they fit the datatype. Non base-10
  284. literals are used mainly for flags and bit pattern representations, therefore
  285. bounds checking is done on bit width, not value range. If the literal fits in
  286. the bit width of the datatype, it is accepted.
  287. Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1
  288. instead of causing an overflow error.
  289. Operators
  290. ---------
  291. Nim allows user defined operators. An operator is any combination of the
  292. following characters::
  293. = + - * / < >
  294. @ $ ~ & % |
  295. ! ? ^ . : \
  296. These keywords are also operators:
  297. ``and or not xor shl shr div mod in notin is isnot of``.
  298. `=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they
  299. are used for other notational purposes.
  300. ``*:`` is as a special case treated as the two tokens `*`:tok: and `:`:tok:
  301. (to support ``var v*: T``).
  302. Other tokens
  303. ------------
  304. The following strings denote other tokens::
  305. ` ( ) { } [ ] , ; [. .] {. .} (. .)
  306. The `slice`:idx: operator `..`:tok: takes precedence over other tokens that
  307. contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok:
  308. and not the two tokens `{.`:tok:, `.}`:tok:.