123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277 |
- Licence of the PCRE library
- ===========================
- PCRE is a library of functions to support regular expressions whose
- syntax and semantics are as close as possible to those of the Perl 5
- language.
- | Written by Philip Hazel
- | Copyright (c) 1997-2005 University of Cambridge
- ----------------------------------------------------------------------
- Redistribution and use in source and binary forms, with or without
- modification, are permitted provided that the following conditions are met:
- * Redistributions of source code must retain the above copyright notice,
- this list of conditions and the following disclaimer.
- * Redistributions in binary form must reproduce the above copyright
- notice, this list of conditions and the following disclaimer in the
- documentation and/or other materials provided with the distribution.
- * Neither the name of the University of Cambridge nor the names of its
- contributors may be used to endorse or promote products derived from
- this software without specific prior written permission.
- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
- LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
- INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
- CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
- POSSIBILITY OF SUCH DAMAGE.
- Regular expression syntax and semantics
- =======================================
- As the regular expressions supported by this module are enormous,
- the reader is referred to http://perldoc.perl.org/perlre.html for the
- full documentation of Perl's regular expressions.
- Because the backslash ``\`` is a meta character both in the Nim
- programming language and in regular expressions, it is strongly
- recommended that one uses the *raw* strings of Nim, so that
- backslashes are interpreted by the regular expression engine:
- ```nim
- r"\S" # matches any character that is not whitespace
- ```
- A regular expression is a pattern that is matched against a subject string
- from left to right. Most characters stand for themselves in a pattern, and
- match the corresponding characters in the subject. As a trivial example,
- the pattern::
- The quick brown fox
- matches a portion of a subject string that is identical to itself.
- The power of regular expressions comes from the ability to include
- alternatives and repetitions in the pattern. These are encoded in
- the pattern by the use of metacharacters, which do not stand for
- themselves but instead are interpreted in some special way.
- There are two different sets of metacharacters: those that are recognized
- anywhere in the pattern except within square brackets, and those that are
- recognized in square brackets. Outside square brackets, the metacharacters
- are as follows:
- ============== ============================================================
- meta character meaning
- ============== ============================================================
- ``\`` general escape character with several uses
- ``^`` assert start of string (or line, in multiline mode)
- ``$`` assert end of string (or line, in multiline mode)
- ``.`` match any character except newline (by default)
- ``[`` start character class definition
- ``|`` start of alternative branch
- ``(`` start subpattern
- ``)`` end subpattern
- ``{`` start min/max quantifier
- ``?`` extends the meaning of ``(``
- | also 0 or 1 quantifier (equal to ``{0,1}``)
- | also quantifier minimizer
- ``*`` 0 or more quantifier (equal to ``{0,}``)
- ``+`` 1 or more quantifier (equal to ``{1,}``)
- | also "possessive quantifier"
- ============== ============================================================
- Part of a pattern that is in square brackets is called a "character class".
- In a character class the only metacharacters are:
- ============== ============================================================
- meta character meaning
- ============== ============================================================
- ``\`` general escape character
- ``^`` negate the class, but only if the first character
- ``-`` indicates character range
- ``[`` POSIX character class (only if followed by POSIX syntax)
- ``]`` terminates the character class
- ============== ============================================================
- The following sections describe the use of each of the metacharacters.
- Backslash
- ---------
- The `backslash`:idx: character has several uses. Firstly, if it is followed
- by a non-alphanumeric character, it takes away any special meaning that
- character may have. This use of backslash as an escape character applies
- both inside and outside character classes.
- For example, if you want to match a ``*`` character, you write ``\*`` in
- the pattern. This escaping action applies whether or not the following
- character would otherwise be interpreted as a metacharacter, so it is always
- safe to precede a non-alphanumeric with backslash to specify that it stands
- for itself. In particular, if you want to match a backslash, you write ``\\``.
- Non-printing characters
- -----------------------
- A second use of backslash provides a way of encoding non-printing characters
- in patterns in a visible manner. There is no restriction on the appearance of
- non-printing characters, apart from the binary zero that terminates a pattern,
- but when a pattern is being prepared by text editing, it is usually easier to
- use one of the following escape sequences than the binary character it
- represents::
- ============== ============================================================
- character meaning
- ============== ============================================================
- ``\a`` alarm, that is, the BEL character (hex 07)
- ``\e`` escape (hex 1B)
- ``\f`` formfeed (hex 0C)
- ``\n`` newline (hex 0A)
- ``\r`` carriage return (hex 0D)
- ``\t`` tab (hex 09)
- ``\ddd`` character with octal code ddd, or backreference
- ``\xhh`` character with hex code hh
- ============== ============================================================
- After ``\x``, from zero to two hexadecimal digits are read (letters can be in
- upper or lower case). In UTF-8 mode, any number of hexadecimal digits may
- appear between ``\x{`` and ``}``, but the value of the character code must be
- less than 2^31 (that is, the maximum hexadecimal value is 7FFFFFFF). If
- characters other than hexadecimal digits appear between ``\x{`` and ``}``, or
- if there is no terminating ``}``, this form of escape is not recognized.
- Instead, the initial ``\x`` will be interpreted as a basic hexadecimal escape,
- with no following digits, giving a character whose value is zero.
- After ``\0`` up to two further octal digits are read. In both cases, if there
- are fewer than two digits, just those that are present are used. Thus the
- sequence ``\0\x\07`` specifies two binary zeros followed by a BEL character
- (code value 7). Make sure you supply two digits after the initial zero if
- the pattern character that follows is itself an octal digit.
- The handling of a backslash followed by a digit other than 0 is complicated.
- Outside a character class, PCRE reads it and any following digits as a
- decimal number. If the number is less than 10, or if there have been at least
- that many previous capturing left parentheses in the expression, the entire
- sequence is taken as a back reference. A description of how this works is
- given later, following the discussion of parenthesized subpatterns.
- Inside a character class, or if the decimal number is greater than 9 and
- there have not been that many capturing subpatterns, PCRE re-reads up to
- three octal digits following the backslash, and generates a single byte
- from the least significant 8 bits of the value. Any subsequent digits stand
- for themselves. For example:
- ============== ============================================================
- example meaning
- ============== ============================================================
- ``\040`` is another way of writing a space
- ``\40`` is the same, provided there are fewer than 40 previous
- capturing subpatterns
- ``\7`` is always a back reference
- ``\11`` might be a back reference, or another way of writing a tab
- ``\011`` is always a tab
- ``\0113`` is a tab followed by the character "3"
- ``\113`` might be a back reference, otherwise the character with
- octal code 113
- ``\377`` might be a back reference, otherwise the byte consisting
- entirely of 1 bits
- ``\81`` is either a back reference, or a binary zero followed by
- the two characters "8" and "1"
- ============== ============================================================
- Note that octal values of 100 or greater must not be introduced by a leading
- zero, because no more than three octal digits are ever read.
- All the sequences that define a single byte value or a single UTF-8 character
- (in UTF-8 mode) can be used both inside and outside character classes. In
- addition, inside a character class, the sequence ``\b`` is interpreted as the
- backspace character (hex 08), and the sequence ``\X`` is interpreted as the
- character "X". Outside a character class, these sequences have different
- meanings (see below).
- Generic character types
- -----------------------
- The third use of backslash is for specifying `generic character types`:idx:.
- The following are always recognized:
- ============== ============================================================
- character type meaning
- ============== ============================================================
- ``\d`` any decimal digit
- ``\D`` any character that is not a decimal digit
- ``\s`` any whitespace character
- ``\S`` any character that is not a whitespace character
- ``\w`` any "word" character
- ``\W`` any "non-word" character
- ============== ============================================================
- Each pair of escape sequences partitions the complete set of characters into
- two disjoint sets. Any given character matches one, and only one, of each pair.
- These character type sequences can appear both inside and outside character
- classes. They each match one character of the appropriate type. If the
- current matching point is at the end of the subject string, all of them fail,
- since there is no character to match.
- For compatibility with Perl, ``\s`` does not match the VT character (code 11).
- This makes it different from the POSIX "space" class. The ``\s`` characters
- are HT (9), LF (10), FF (12), CR (13), and space (32).
- A "word" character is an underscore or any character less than 256 that is
- a letter or digit. The definition of letters and digits is controlled by
- PCRE's low-valued character tables, and may vary if locale-specific matching
- is taking place (see "Locale support" in the pcreapi page). For example,
- in the "fr_FR" (French) locale, some character codes greater than 128 are
- used for accented letters, and these are matched by ``\w``.
- In UTF-8 mode, characters with values greater than 128 never match ``\d``,
- ``\s``, or ``\w``, and always match ``\D``, ``\S``, and ``\W``. This is true
- even when Unicode character property support is available.
- Simple assertions
- -----------------
- The fourth use of backslash is for certain `simple assertions`:idx:. An
- assertion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The use of
- subpatterns for more complicated assertions is described below. The
- backslashed assertions are::
- ============== ============================================================
- assertion meaning
- ============== ============================================================
- ``\b`` matches at a word boundary
- ``\B`` matches when not at a word boundary
- ``\A`` matches at start of subject
- ``\Z`` matches at end of subject or before newline at end
- ``\z`` matches at end of subject
- ``\G`` matches at first matching position in subject
- ============== ============================================================
- These assertions may not appear in character classes (but note that ``\b``
- has a different meaning, namely the backspace character, inside a character
- class).
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match ``\w`` or ``\W`` (i.e.
- one matches ``\w`` and the other matches ``\W``), or the start or end of the
- string if the first or last character matches ``\w``, respectively.
- The ``\A``, ``\Z``, and ``\z`` assertions differ from the traditional
- circumflex and dollar in that they only ever match at the very start and
- end of the subject string, whatever options are set.
- The difference between ``\Z`` and ``\z`` is that ``\Z`` matches before
- a newline that is the last character of the string as well as at the end
- of the string, whereas ``\z`` matches only at the end.
|