123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539 |
- @c -*-texinfo-*-
- @c This is part of the GNU Guile Reference Manual.
- @c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012
- @c Free Software Foundation, Inc.
- @c See the file guile.texi for copying conditions.
- @node Regular Expressions
- @section Regular Expressions
- @tpindex Regular expressions
- @cindex regular expressions
- @cindex regex
- @cindex emacs regexp
- A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
- describes a whole class of strings. A full description of regular
- expressions and their syntax is beyond the scope of this manual.
- If your system does not include a POSIX regular expression library,
- and you have not linked Guile with a third-party regexp library such
- as Rx, these functions will not be available. You can tell whether
- your Guile installation includes regular expression support by
- checking whether @code{(provided? 'regex)} returns true.
- The following regexp and string matching features are provided by the
- @code{(ice-9 regex)} module. Before using the described functions,
- you should load this module by executing @code{(use-modules (ice-9
- regex))}.
- @menu
- * Regexp Functions:: Functions that create and match regexps.
- * Match Structures:: Finding what was matched by a regexp.
- * Backslash Escapes:: Removing the special meaning of regexp
- meta-characters.
- @end menu
- @node Regexp Functions
- @subsection Regexp Functions
- By default, Guile supports POSIX extended regular expressions. That
- means that the characters @samp{(}, @samp{)}, @samp{+} and @samp{?} are
- special, and must be escaped if you wish to match the literal characters
- and there is no support for ``non-greedy'' variants of @samp{*},
- @samp{+} or @samp{?}.
- This regular expression interface was modeled after that
- implemented by SCSH, the Scheme Shell. It is intended to be
- upwardly compatible with SCSH regular expressions.
- Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
- strings, since the underlying C functions treat that as the end of
- string. If there's a zero byte an error is thrown.
- Internally, patterns and input strings are converted to the current
- locale's encoding, and then passed to the C library's regular expression
- routines (@pxref{Regular Expressions,,, libc, The GNU C Library
- Reference Manual}). The returned match structures always point to
- characters in the strings, not to individual bytes, even in the case of
- multi-byte encodings. This ensures that the match structures are
- correct when performing matching with characters that have a multi-byte
- representation in the locale encoding. Note, however, that using
- characters which cannot be represented in the locale encoding can
- lead to surprising results.
- @deffn {Scheme Procedure} string-match pattern str [start]
- Compile the string @var{pattern} into a regular expression and compare
- it with @var{str}. The optional numeric argument @var{start} specifies
- the position of @var{str} at which to begin matching.
- @code{string-match} returns a @dfn{match structure} which
- describes what, if anything, was matched by the regular
- expression. @xref{Match Structures}. If @var{str} does not match
- @var{pattern} at all, @code{string-match} returns @code{#f}.
- @end deffn
- Two examples of a match follow. In the first example, the pattern
- matches the four digits in the match string. In the second, the pattern
- matches nothing.
- @example
- (string-match "[0-9][0-9][0-9][0-9]" "blah2002")
- @result{} #("blah2002" (4 . 8))
- (string-match "[A-Za-z]" "123456")
- @result{} #f
- @end example
- Each time @code{string-match} is called, it must compile its
- @var{pattern} argument into a regular expression structure. This
- operation is expensive, which makes @code{string-match} inefficient if
- the same regular expression is used several times (for example, in a
- loop). For better performance, you can compile a regular expression in
- advance and then match strings against the compiled regexp.
- @deffn {Scheme Procedure} make-regexp pat flag@dots{}
- @deffnx {C Function} scm_make_regexp (pat, flaglst)
- Compile the regular expression described by @var{pat}, and
- return the compiled regexp structure. If @var{pat} does not
- describe a legal regular expression, @code{make-regexp} throws
- a @code{regular-expression-syntax} error.
- The @var{flag} arguments change the behavior of the compiled
- regular expression. The following values may be supplied:
- @defvar regexp/icase
- Consider uppercase and lowercase letters to be the same when
- matching.
- @end defvar
- @defvar regexp/newline
- If a newline appears in the target string, then permit the
- @samp{^} and @samp{$} operators to match immediately after or
- immediately before the newline, respectively. Also, the
- @samp{.} and @samp{[^...]} operators will never match a newline
- character. The intent of this flag is to treat the target
- string as a buffer containing many lines of text, and the
- regular expression as a pattern that may match a single one of
- those lines.
- @end defvar
- @defvar regexp/basic
- Compile a basic (``obsolete'') regexp instead of the extended
- (``modern'') regexps that are the default. Basic regexps do
- not consider @samp{|}, @samp{+} or @samp{?} to be special
- characters, and require the @samp{@{...@}} and @samp{(...)}
- metacharacters to be backslash-escaped (@pxref{Backslash
- Escapes}). There are several other differences between basic
- and extended regular expressions, but these are the most
- significant.
- @end defvar
- @defvar regexp/extended
- Compile an extended regular expression rather than a basic
- regexp. This is the default behavior; this flag will not
- usually be needed. If a call to @code{make-regexp} includes
- both @code{regexp/basic} and @code{regexp/extended} flags, the
- one which comes last will override the earlier one.
- @end defvar
- @end deffn
- @deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
- @deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
- Match the compiled regular expression @var{rx} against
- @code{str}. If the optional integer @var{start} argument is
- provided, begin matching from that position in the string.
- Return a match structure describing the results of the match,
- or @code{#f} if no match could be found.
- The @var{flags} argument changes the matching behavior. The following
- flag values may be supplied, use @code{logior} (@pxref{Bitwise
- Operations}) to combine them,
- @defvar regexp/notbol
- Consider that the @var{start} offset into @var{str} is not the
- beginning of a line and should not match operator @samp{^}.
- If @var{rx} was created with the @code{regexp/newline} option above,
- @samp{^} will still match after a newline in @var{str}.
- @end defvar
- @defvar regexp/noteol
- Consider that the end of @var{str} is not the end of a line and should
- not match operator @samp{$}.
- If @var{rx} was created with the @code{regexp/newline} option above,
- @samp{$} will still match before a newline in @var{str}.
- @end defvar
- @end deffn
- @lisp
- ;; Regexp to match uppercase letters
- (define r (make-regexp "[A-Z]*"))
- ;; Regexp to match letters, ignoring case
- (define ri (make-regexp "[A-Z]*" regexp/icase))
- ;; Search for bob using regexp r
- (match:substring (regexp-exec r "bob"))
- @result{} "" ; no match
- ;; Search for bob using regexp ri
- (match:substring (regexp-exec ri "Bob"))
- @result{} "Bob" ; matched case insensitive
- @end lisp
- @deffn {Scheme Procedure} regexp? obj
- @deffnx {C Function} scm_regexp_p (obj)
- Return @code{#t} if @var{obj} is a compiled regular expression,
- or @code{#f} otherwise.
- @end deffn
- @sp 1
- @deffn {Scheme Procedure} list-matches regexp str [flags]
- Return a list of match structures which are the non-overlapping
- matches of @var{regexp} in @var{str}. @var{regexp} can be either a
- pattern string or a compiled regexp. The @var{flags} argument is as
- per @code{regexp-exec} above.
- @example
- (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
- @result{} ("abc" "def")
- @end example
- @end deffn
- @deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
- Apply @var{proc} to the non-overlapping matches of @var{regexp} in
- @var{str}, to build a result. @var{regexp} can be either a pattern
- string or a compiled regexp. The @var{flags} argument is as per
- @code{regexp-exec} above.
- @var{proc} is called as @code{(@var{proc} match prev)} where
- @var{match} is a match structure and @var{prev} is the previous return
- from @var{proc}. For the first call @var{prev} is the given
- @var{init} parameter. @code{fold-matches} returns the final value
- from @var{proc}.
- For example to count matches,
- @example
- (fold-matches "[a-z][0-9]" "abc x1 def y2" 0
- (lambda (match count)
- (1+ count)))
- @result{} 2
- @end example
- @end deffn
- @sp 1
- Regular expressions are commonly used to find patterns in one string
- and replace them with the contents of another string. The following
- functions are convenient ways to do this.
- @c begin (scm-doc-string "regex.scm" "regexp-substitute")
- @deffn {Scheme Procedure} regexp-substitute port match item @dots{}
- Write to @var{port} selected parts of the match structure @var{match}.
- Or if @var{port} is @code{#f} then form a string from those parts and
- return that.
- Each @var{item} specifies a part to be written, and may be one of the
- following,
- @itemize @bullet
- @item
- A string. String arguments are written out verbatim.
- @item
- An integer. The submatch with that number is written
- (@code{match:substring}). Zero is the entire match.
- @item
- The symbol @samp{pre}. The portion of the matched string preceding
- the regexp match is written (@code{match:prefix}).
- @item
- The symbol @samp{post}. The portion of the matched string following
- the regexp match is written (@code{match:suffix}).
- @end itemize
- For example, changing a match and retaining the text before and after,
- @example
- (regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
- 'pre "37" 'post)
- @result{} "number 37 is good"
- @end example
- Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
- re-ordering and hyphenating the fields.
- @lisp
- (define date-regex
- "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
- (define s "Date 20020429 12am.")
- (regexp-substitute #f (string-match date-regex s)
- 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
- @result{} "Date 04-29-2002 12am. (20020429)"
- @end lisp
- @end deffn
- @c begin (scm-doc-string "regex.scm" "regexp-substitute")
- @deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{}
- @cindex search and replace
- Write to @var{port} selected parts of matches of @var{regexp} in
- @var{target}. If @var{port} is @code{#f} then form a string from
- those parts and return that. @var{regexp} can be a string or a
- compiled regex.
- This is similar to @code{regexp-substitute}, but allows global
- substitutions on @var{target}. Each @var{item} behaves as per
- @code{regexp-substitute}, with the following differences,
- @itemize @bullet
- @item
- A function. Called as @code{(@var{item} match)} with the match
- structure for the @var{regexp} match, it should return a string to be
- written to @var{port}.
- @item
- The symbol @samp{post}. This doesn't output anything, but instead
- causes @code{regexp-substitute/global} to recurse on the unmatched
- portion of @var{target}.
- This @emph{must} be supplied to perform a global search and replace on
- @var{target}; without it @code{regexp-substitute/global} returns after
- a single match and output.
- @end itemize
- For example, to collapse runs of tabs and spaces to a single hyphen
- each,
- @example
- (regexp-substitute/global #f "[ \t]+" "this is the text"
- 'pre "-" 'post)
- @result{} "this-is-the-text"
- @end example
- Or using a function to reverse the letters in each word,
- @example
- (regexp-substitute/global #f "[a-z]+" "to do and not-do"
- 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
- @result{} "ot od dna ton-od"
- @end example
- Without the @code{post} symbol, just one regexp match is made. For
- example the following is the date example from
- @code{regexp-substitute} above, without the need for the separate
- @code{string-match} call.
- @lisp
- (define date-regex
- "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
- (define s "Date 20020429 12am.")
- (regexp-substitute/global #f date-regex s
- 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
- @result{} "Date 04-29-2002 12am. (20020429)"
- @end lisp
- @end deffn
- @node Match Structures
- @subsection Match Structures
- @cindex match structures
- A @dfn{match structure} is the object returned by @code{string-match} and
- @code{regexp-exec}. It describes which portion of a string, if any,
- matched the given regular expression. Match structures include: a
- reference to the string that was checked for matches; the starting and
- ending positions of the regexp match; and, if the regexp included any
- parenthesized subexpressions, the starting and ending positions of each
- submatch.
- In each of the regexp match functions described below, the @code{match}
- argument must be a match structure returned by a previous call to
- @code{string-match} or @code{regexp-exec}. Most of these functions
- return some information about the original target string that was
- matched against a regular expression; we will call that string
- @var{target} for easy reference.
- @c begin (scm-doc-string "regex.scm" "regexp-match?")
- @deffn {Scheme Procedure} regexp-match? obj
- Return @code{#t} if @var{obj} is a match structure returned by a
- previous call to @code{regexp-exec}, or @code{#f} otherwise.
- @end deffn
- @c begin (scm-doc-string "regex.scm" "match:substring")
- @deffn {Scheme Procedure} match:substring match [n]
- Return the portion of @var{target} matched by subexpression number
- @var{n}. Submatch 0 (the default) represents the entire regexp match.
- If the regular expression as a whole matched, but the subexpression
- number @var{n} did not match, return @code{#f}.
- @end deffn
- @lisp
- (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
- (match:substring s)
- @result{} "2002"
- ;; match starting at offset 6 in the string
- (match:substring
- (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
- @result{} "7654"
- @end lisp
- @c begin (scm-doc-string "regex.scm" "match:start")
- @deffn {Scheme Procedure} match:start match [n]
- Return the starting position of submatch number @var{n}.
- @end deffn
- In the following example, the result is 4, since the match starts at
- character index 4:
- @lisp
- (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
- (match:start s)
- @result{} 4
- @end lisp
- @c begin (scm-doc-string "regex.scm" "match:end")
- @deffn {Scheme Procedure} match:end match [n]
- Return the ending position of submatch number @var{n}.
- @end deffn
- In the following example, the result is 8, since the match runs between
- characters 4 and 8 (i.e.@: the ``2002'').
- @lisp
- (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
- (match:end s)
- @result{} 8
- @end lisp
- @c begin (scm-doc-string "regex.scm" "match:prefix")
- @deffn {Scheme Procedure} match:prefix match
- Return the unmatched portion of @var{target} preceding the regexp match.
- @lisp
- (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
- (match:prefix s)
- @result{} "blah"
- @end lisp
- @end deffn
- @c begin (scm-doc-string "regex.scm" "match:suffix")
- @deffn {Scheme Procedure} match:suffix match
- Return the unmatched portion of @var{target} following the regexp match.
- @end deffn
- @lisp
- (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
- (match:suffix s)
- @result{} "foo"
- @end lisp
- @c begin (scm-doc-string "regex.scm" "match:count")
- @deffn {Scheme Procedure} match:count match
- Return the number of parenthesized subexpressions from @var{match}.
- Note that the entire regular expression match itself counts as a
- subexpression, and failed submatches are included in the count.
- @end deffn
- @c begin (scm-doc-string "regex.scm" "match:string")
- @deffn {Scheme Procedure} match:string match
- Return the original @var{target} string.
- @end deffn
- @lisp
- (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
- (match:string s)
- @result{} "blah2002foo"
- @end lisp
- @node Backslash Escapes
- @subsection Backslash Escapes
- Sometimes you will want a regexp to match characters like @samp{*} or
- @samp{$} exactly. For example, to check whether a particular string
- represents a menu entry from an Info node, it would be useful to match
- it against a regexp like @samp{^* [^:]*::}. However, this won't work;
- because the asterisk is a metacharacter, it won't match the @samp{*} at
- the beginning of the string. In this case, we want to make the first
- asterisk un-magic.
- You can do this by preceding the metacharacter with a backslash
- character @samp{\}. (This is also called @dfn{quoting} the
- metacharacter, and is known as a @dfn{backslash escape}.) When Guile
- sees a backslash in a regular expression, it considers the following
- glyph to be an ordinary character, no matter what special meaning it
- would ordinarily have. Therefore, we can make the above example work by
- changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
- the regular expression engine to match only a single asterisk in the
- target string.
- Since the backslash is itself a metacharacter, you may force a regexp to
- match a backslash in the target string by preceding the backslash with
- itself. For example, to find variable references in a @TeX{} program,
- you might want to find occurrences of the string @samp{\let\} followed
- by any number of alphabetic characters. The regular expression
- @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
- regexp each match a single backslash in the target string.
- @c begin (scm-doc-string "regex.scm" "regexp-quote")
- @deffn {Scheme Procedure} regexp-quote str
- Quote each special character found in @var{str} with a backslash, and
- return the resulting string.
- @end deffn
- @strong{Very important:} Using backslash escapes in Guile source code
- (as in Emacs Lisp or C) can be tricky, because the backslash character
- has special meaning for the Guile reader. For example, if Guile
- encounters the character sequence @samp{\n} in the middle of a string
- while processing Scheme code, it replaces those characters with a
- newline character. Similarly, the character sequence @samp{\t} is
- replaced by a horizontal tab. Several of these @dfn{escape sequences}
- are processed by the Guile reader before your code is executed.
- Unrecognized escape sequences are ignored: if the characters @samp{\*}
- appear in a string, they will be translated to the single character
- @samp{*}.
- This translation is obviously undesirable for regular expressions, since
- we want to be able to include backslashes in a string in order to
- escape regexp metacharacters. Therefore, to make sure that a backslash
- is preserved in a string in your Guile program, you must use @emph{two}
- consecutive backslashes:
- @lisp
- (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
- @end lisp
- The string in this example is preprocessed by the Guile reader before
- any code is executed. The resulting argument to @code{make-regexp} is
- the string @samp{^\* [^:]*}, which is what we really want.
- This also means that in order to write a regular expression that matches
- a single backslash character, the regular expression string in the
- source code must include @emph{four} backslashes. Each consecutive pair
- of backslashes gets translated by the Guile reader to a single
- backslash, and the resulting double-backslash is interpreted by the
- regexp engine as matching a single backslash character. Hence:
- @lisp
- (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
- @end lisp
- The reason for the unwieldiness of this syntax is historical. Both
- regular expression pattern matchers and Unix string processing systems
- have traditionally used backslashes with the special meanings
- described above. The POSIX regular expression specification and ANSI C
- standard both require these semantics. Attempting to abandon either
- convention would cause other kinds of compatibility problems, possibly
- more severe ones. Therefore, without extending the Scheme reader to
- support strings with different quoting conventions (an ungainly and
- confusing extension when implemented in other languages), we must adhere
- to this cumbersome escape syntax.
|