123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223 |
- PEG syntax and semantics
- ========================
- A PEG (Parsing expression grammar) is a simple deterministic grammar, that can
- be directly used for parsing. The current implementation has been designed as
- a more powerful replacement for regular expressions. UTF-8 is supported.
- The notation used for a PEG is similar to that of EBNF:
- =============== ============================================================
- notation meaning
- =============== ============================================================
- ``A / ... / Z`` Ordered choice: Apply expressions `A`, ..., `Z`, in this
- order, to the text ahead, until one of them succeeds and
- possibly consumes some text. Indicate success if one of
- expressions succeeded. Otherwise do not consume any text
- and indicate failure.
- ``A ... Z`` Sequence: Apply expressions `A`, ..., `Z`, in this order,
- to consume consecutive portions of the text ahead, as long
- as they succeed. Indicate success if all succeeded.
- Otherwise do not consume any text and indicate failure.
- The sequence's precedence is higher than that of ordered
- choice: ``A B / C`` means ``(A B) / Z`` and
- not ``A (B / Z)``.
- ``(E)`` Grouping: Parenthesis can be used to change
- operator priority.
- ``{E}`` Capture: Apply expression `E` and store the substring
- that matched `E` into a *capture* that can be accessed
- after the matching process.
- ``$i`` Back reference to the ``i``th capture. ``i`` counts from 1.
- ``$`` Anchor: Matches at the end of the input. No character
- is consumed. Same as ``!.``.
- ``^`` Anchor: Matches at the start of the input. No character
- is consumed.
- ``&E`` And predicate: Indicate success if expression `E` matches
- the text ahead; otherwise indicate failure. Do not consume
- any text.
- ``!E`` Not predicate: Indicate failure if expression E matches the
- text ahead; otherwise indicate success. Do not consume any
- text.
- ``E+`` One or more: Apply expression `E` repeatedly to match
- the text ahead, as long as it succeeds. Consume the matched
- text (if any) and indicate success if there was at least
- one match. Otherwise indicate failure.
- ``E*`` Zero or more: Apply expression `E` repeatedly to match
- the text ahead, as long as it succeeds. Consume the matched
- text (if any). Always indicate success.
- ``E?`` Zero or one: If expression `E` matches the text ahead,
- consume it. Always indicate success.
- ``[s]`` Character class: If the character ahead appears in the
- string `s`, consume it and indicate success. Otherwise
- indicate failure.
- ``[a-b]`` Character range: If the character ahead is one from the
- range `a` through `b`, consume it and indicate success.
- Otherwise indicate failure.
- ``'s'`` String: If the text ahead is the string `s`, consume it
- and indicate success. Otherwise indicate failure.
- ``i's'`` String match ignoring case.
- ``y's'`` String match ignoring style.
- ``v's'`` Verbatim string match: Use this to override a global
- ``\i`` or ``\y`` modifier.
- ``i$j`` String match ignoring case for back reference.
- ``y$j`` String match ignoring style for back reference.
- ``v$j`` Verbatim string match for back reference.
- ``.`` Any character: If there is a character ahead, consume it
- and indicate success. Otherwise (that is, at the end of
- input) indicate failure.
- ``_`` Any Unicode character: If there is an UTF-8 character
- ahead, consume it and indicate success. Otherwise indicate
- failure.
- ``@E`` Search: Shorthand for ``(!E .)* E``. (Search loop for the
- pattern `E`.)
- ``{@} E`` Captured Search: Shorthand for ``{(!E .)*} E``. (Search
- loop for the pattern `E`.) Everything until and exluding
- `E` is captured.
- ``@@ E`` Same as ``{@} E``.
- ``A <- E`` Rule: Bind the expression `E` to the *nonterminal symbol*
- `A`. **Left recursive rules are not possible and crash the
- matching engine.**
- ``\identifier`` Built-in macro for a longer expression.
- ``\ddd`` Character with decimal code *ddd*.
- ``\"``, etc Literal ``"``, etc.
- =============== ============================================================
- Built-in macros
- ---------------
- ============== ============================================================
- macro meaning
- ============== ============================================================
- ``\d`` any decimal digit: ``[0-9]``
- ``\D`` any character that is not a decimal digit: ``[^0-9]``
- ``\s`` any whitespace character: ``[ \9-\13]``
- ``\S`` any character that is not a whitespace character:
- ``[^ \9-\13]``
- ``\w`` any "word" character: ``[a-zA-Z0-9_]``
- ``\W`` any "non-word" character: ``[^a-zA-Z0-9_]``
- ``\a`` same as ``[a-zA-Z]``
- ``\A`` same as ``[^a-zA-Z]``
- ``\n`` any newline combination: ``\10 / \13\10 / \13``
- ``\i`` ignore case for matching; use this at the start of the PEG
- ``\y`` ignore style for matching; use this at the start of the PEG
- ``\skip`` pat skip pattern *pat* before trying to match other tokens;
- this is useful for whitespace skipping, for example:
- ``\skip(\s*) {\ident} ':' {\ident}`` matches key value
- pairs ignoring whitespace around the ``':'``.
- ``\ident`` a standard ASCII identifier: ``[a-zA-Z_][a-zA-Z_0-9]*``
- ``\letter`` any Unicode letter
- ``\upper`` any Unicode uppercase letter
- ``\lower`` any Unicode lowercase letter
- ``\title`` any Unicode title letter
- ``\white`` any Unicode whitespace character
- ============== ============================================================
- A backslash followed by a letter is a built-in macro, otherwise it
- is used for ordinary escaping:
- ============== ============================================================
- notation meaning
- ============== ============================================================
- ``\\`` a single backslash
- ``\*`` same as ``'*'``
- ``\t`` not a tabulator, but an (unknown) built-in
- ============== ============================================================
- Supported PEG grammar
- ---------------------
- The PEG parser implements this grammar (written in PEG syntax)::
- # Example grammar of PEG in PEG syntax.
- # Comments start with '#'.
- # First symbol is the start symbol.
- grammar <- rule* / expr
- identifier <- [A-Za-z][A-Za-z0-9_]*
- charsetchar <- "\\" . / [^\]]
- charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]"
- stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" /
- "'" ("\\" . / [^'])* "'")
- builtin <- "\\" identifier / [^\13\10]
- comment <- '#' @ \n
- ig <- (\s / comment)* # things to ignore
- rule <- identifier \s* "<-" expr ig
- identNoArrow <- identifier !(\s* "<-")
- prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@'
- literal <- ig identifier? '$' [0-9]+ / '$' / '^' /
- ig identNoArrow /
- ig charset /
- ig stringlit /
- ig builtin /
- ig '.' /
- ig '_' /
- (ig "(" expr ig ")")
- postfixOpr <- ig '?' / ig '*' / ig '+'
- primary <- prefixOpr* (literal postfixOpr*)
- # Concatenation has higher priority than choice:
- # ``a b / c`` means ``(a b) / c``
- seqExpr <- primary+
- expr <- seqExpr (ig "/" expr)*
- **Note**: As a special syntactic extension if the whole PEG is only a single
- expression, identifiers are not interpreted as non-terminals, but are
- interpreted as verbatim string:
- .. code-block:: nim
- abc =~ peg"abc" # is true
- So it is not necessary to write ``peg" 'abc' "`` in the above example.
- Examples
- --------
- Check if `s` matches Nim's "while" keyword:
- .. code-block:: nim
- s =~ peg" y'while'"
- Exchange (key, val)-pairs:
- .. code-block:: nim
- "key: val; key2: val2".replacef(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1")
- Determine the ``#include``'ed files of a C file:
- .. code-block:: nim
- for line in lines("myfile.c"):
- if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws
- comment <- '/*' @ '*/' / '//' .*
- ws <- (comment / \s+)* """:
- echo matches[0]
- PEG vs regular expression
- -------------------------
- As a regular expression ``\[.*\]`` matches the longest possible text between
- ``'['`` and ``']'``. As a PEG it never matches anything, because a PEG is
- deterministic: ``.*`` consumes the rest of the input, so ``\]`` never matches.
- As a PEG this needs to be written as: ``\[ ( !\] . )* \]`` (or ``\[ @ \]``).
- Note that the regular expression does not behave as intended either: in the
- example ``*`` should not be greedy, so ``\[.*?\]`` should be used instead.
- PEG construction
- ----------------
- There are two ways to construct a PEG in Nim code:
- (1) Parsing a string into an AST which consists of `Peg` nodes with the
- `peg` proc.
- (2) Constructing the AST directly with proc calls. This method does not
- support constructing rules, only simple expressions and is not as
- convenient. Its only advantage is that it does not pull in the whole PEG
- parser into your executable.
|