api-regex.texi 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539
  1. @c -*-texinfo-*-
  2. @c This is part of the GNU Guile Reference Manual.
  3. @c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012
  4. @c Free Software Foundation, Inc.
  5. @c See the file guile.texi for copying conditions.
  6. @node Regular Expressions
  7. @section Regular Expressions
  8. @tpindex Regular expressions
  9. @cindex regular expressions
  10. @cindex regex
  11. @cindex emacs regexp
  12. A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
  13. describes a whole class of strings. A full description of regular
  14. expressions and their syntax is beyond the scope of this manual.
  15. If your system does not include a POSIX regular expression library,
  16. and you have not linked Guile with a third-party regexp library such
  17. as Rx, these functions will not be available. You can tell whether
  18. your Guile installation includes regular expression support by
  19. checking whether @code{(provided? 'regex)} returns true.
  20. The following regexp and string matching features are provided by the
  21. @code{(ice-9 regex)} module. Before using the described functions,
  22. you should load this module by executing @code{(use-modules (ice-9
  23. regex))}.
  24. @menu
  25. * Regexp Functions:: Functions that create and match regexps.
  26. * Match Structures:: Finding what was matched by a regexp.
  27. * Backslash Escapes:: Removing the special meaning of regexp
  28. meta-characters.
  29. @end menu
  30. @node Regexp Functions
  31. @subsection Regexp Functions
  32. By default, Guile supports POSIX extended regular expressions. That
  33. means that the characters @samp{(}, @samp{)}, @samp{+} and @samp{?} are
  34. special, and must be escaped if you wish to match the literal characters
  35. and there is no support for ``non-greedy'' variants of @samp{*},
  36. @samp{+} or @samp{?}.
  37. This regular expression interface was modeled after that
  38. implemented by SCSH, the Scheme Shell. It is intended to be
  39. upwardly compatible with SCSH regular expressions.
  40. Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
  41. strings, since the underlying C functions treat that as the end of
  42. string. If there's a zero byte an error is thrown.
  43. Internally, patterns and input strings are converted to the current
  44. locale's encoding, and then passed to the C library's regular expression
  45. routines (@pxref{Regular Expressions,,, libc, The GNU C Library
  46. Reference Manual}). The returned match structures always point to
  47. characters in the strings, not to individual bytes, even in the case of
  48. multi-byte encodings. This ensures that the match structures are
  49. correct when performing matching with characters that have a multi-byte
  50. representation in the locale encoding. Note, however, that using
  51. characters which cannot be represented in the locale encoding can
  52. lead to surprising results.
  53. @deffn {Scheme Procedure} string-match pattern str [start]
  54. Compile the string @var{pattern} into a regular expression and compare
  55. it with @var{str}. The optional numeric argument @var{start} specifies
  56. the position of @var{str} at which to begin matching.
  57. @code{string-match} returns a @dfn{match structure} which
  58. describes what, if anything, was matched by the regular
  59. expression. @xref{Match Structures}. If @var{str} does not match
  60. @var{pattern} at all, @code{string-match} returns @code{#f}.
  61. @end deffn
  62. Two examples of a match follow. In the first example, the pattern
  63. matches the four digits in the match string. In the second, the pattern
  64. matches nothing.
  65. @example
  66. (string-match "[0-9][0-9][0-9][0-9]" "blah2002")
  67. @result{} #("blah2002" (4 . 8))
  68. (string-match "[A-Za-z]" "123456")
  69. @result{} #f
  70. @end example
  71. Each time @code{string-match} is called, it must compile its
  72. @var{pattern} argument into a regular expression structure. This
  73. operation is expensive, which makes @code{string-match} inefficient if
  74. the same regular expression is used several times (for example, in a
  75. loop). For better performance, you can compile a regular expression in
  76. advance and then match strings against the compiled regexp.
  77. @deffn {Scheme Procedure} make-regexp pat flag@dots{}
  78. @deffnx {C Function} scm_make_regexp (pat, flaglst)
  79. Compile the regular expression described by @var{pat}, and
  80. return the compiled regexp structure. If @var{pat} does not
  81. describe a legal regular expression, @code{make-regexp} throws
  82. a @code{regular-expression-syntax} error.
  83. The @var{flag} arguments change the behavior of the compiled
  84. regular expression. The following values may be supplied:
  85. @defvar regexp/icase
  86. Consider uppercase and lowercase letters to be the same when
  87. matching.
  88. @end defvar
  89. @defvar regexp/newline
  90. If a newline appears in the target string, then permit the
  91. @samp{^} and @samp{$} operators to match immediately after or
  92. immediately before the newline, respectively. Also, the
  93. @samp{.} and @samp{[^...]} operators will never match a newline
  94. character. The intent of this flag is to treat the target
  95. string as a buffer containing many lines of text, and the
  96. regular expression as a pattern that may match a single one of
  97. those lines.
  98. @end defvar
  99. @defvar regexp/basic
  100. Compile a basic (``obsolete'') regexp instead of the extended
  101. (``modern'') regexps that are the default. Basic regexps do
  102. not consider @samp{|}, @samp{+} or @samp{?} to be special
  103. characters, and require the @samp{@{...@}} and @samp{(...)}
  104. metacharacters to be backslash-escaped (@pxref{Backslash
  105. Escapes}). There are several other differences between basic
  106. and extended regular expressions, but these are the most
  107. significant.
  108. @end defvar
  109. @defvar regexp/extended
  110. Compile an extended regular expression rather than a basic
  111. regexp. This is the default behavior; this flag will not
  112. usually be needed. If a call to @code{make-regexp} includes
  113. both @code{regexp/basic} and @code{regexp/extended} flags, the
  114. one which comes last will override the earlier one.
  115. @end defvar
  116. @end deffn
  117. @deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
  118. @deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
  119. Match the compiled regular expression @var{rx} against
  120. @code{str}. If the optional integer @var{start} argument is
  121. provided, begin matching from that position in the string.
  122. Return a match structure describing the results of the match,
  123. or @code{#f} if no match could be found.
  124. The @var{flags} argument changes the matching behavior. The following
  125. flag values may be supplied, use @code{logior} (@pxref{Bitwise
  126. Operations}) to combine them,
  127. @defvar regexp/notbol
  128. Consider that the @var{start} offset into @var{str} is not the
  129. beginning of a line and should not match operator @samp{^}.
  130. If @var{rx} was created with the @code{regexp/newline} option above,
  131. @samp{^} will still match after a newline in @var{str}.
  132. @end defvar
  133. @defvar regexp/noteol
  134. Consider that the end of @var{str} is not the end of a line and should
  135. not match operator @samp{$}.
  136. If @var{rx} was created with the @code{regexp/newline} option above,
  137. @samp{$} will still match before a newline in @var{str}.
  138. @end defvar
  139. @end deffn
  140. @lisp
  141. ;; Regexp to match uppercase letters
  142. (define r (make-regexp "[A-Z]*"))
  143. ;; Regexp to match letters, ignoring case
  144. (define ri (make-regexp "[A-Z]*" regexp/icase))
  145. ;; Search for bob using regexp r
  146. (match:substring (regexp-exec r "bob"))
  147. @result{} "" ; no match
  148. ;; Search for bob using regexp ri
  149. (match:substring (regexp-exec ri "Bob"))
  150. @result{} "Bob" ; matched case insensitive
  151. @end lisp
  152. @deffn {Scheme Procedure} regexp? obj
  153. @deffnx {C Function} scm_regexp_p (obj)
  154. Return @code{#t} if @var{obj} is a compiled regular expression,
  155. or @code{#f} otherwise.
  156. @end deffn
  157. @sp 1
  158. @deffn {Scheme Procedure} list-matches regexp str [flags]
  159. Return a list of match structures which are the non-overlapping
  160. matches of @var{regexp} in @var{str}. @var{regexp} can be either a
  161. pattern string or a compiled regexp. The @var{flags} argument is as
  162. per @code{regexp-exec} above.
  163. @example
  164. (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
  165. @result{} ("abc" "def")
  166. @end example
  167. @end deffn
  168. @deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
  169. Apply @var{proc} to the non-overlapping matches of @var{regexp} in
  170. @var{str}, to build a result. @var{regexp} can be either a pattern
  171. string or a compiled regexp. The @var{flags} argument is as per
  172. @code{regexp-exec} above.
  173. @var{proc} is called as @code{(@var{proc} match prev)} where
  174. @var{match} is a match structure and @var{prev} is the previous return
  175. from @var{proc}. For the first call @var{prev} is the given
  176. @var{init} parameter. @code{fold-matches} returns the final value
  177. from @var{proc}.
  178. For example to count matches,
  179. @example
  180. (fold-matches "[a-z][0-9]" "abc x1 def y2" 0
  181. (lambda (match count)
  182. (1+ count)))
  183. @result{} 2
  184. @end example
  185. @end deffn
  186. @sp 1
  187. Regular expressions are commonly used to find patterns in one string
  188. and replace them with the contents of another string. The following
  189. functions are convenient ways to do this.
  190. @c begin (scm-doc-string "regex.scm" "regexp-substitute")
  191. @deffn {Scheme Procedure} regexp-substitute port match item @dots{}
  192. Write to @var{port} selected parts of the match structure @var{match}.
  193. Or if @var{port} is @code{#f} then form a string from those parts and
  194. return that.
  195. Each @var{item} specifies a part to be written, and may be one of the
  196. following,
  197. @itemize @bullet
  198. @item
  199. A string. String arguments are written out verbatim.
  200. @item
  201. An integer. The submatch with that number is written
  202. (@code{match:substring}). Zero is the entire match.
  203. @item
  204. The symbol @samp{pre}. The portion of the matched string preceding
  205. the regexp match is written (@code{match:prefix}).
  206. @item
  207. The symbol @samp{post}. The portion of the matched string following
  208. the regexp match is written (@code{match:suffix}).
  209. @end itemize
  210. For example, changing a match and retaining the text before and after,
  211. @example
  212. (regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
  213. 'pre "37" 'post)
  214. @result{} "number 37 is good"
  215. @end example
  216. Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
  217. re-ordering and hyphenating the fields.
  218. @lisp
  219. (define date-regex
  220. "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
  221. (define s "Date 20020429 12am.")
  222. (regexp-substitute #f (string-match date-regex s)
  223. 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
  224. @result{} "Date 04-29-2002 12am. (20020429)"
  225. @end lisp
  226. @end deffn
  227. @c begin (scm-doc-string "regex.scm" "regexp-substitute")
  228. @deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{}
  229. @cindex search and replace
  230. Write to @var{port} selected parts of matches of @var{regexp} in
  231. @var{target}. If @var{port} is @code{#f} then form a string from
  232. those parts and return that. @var{regexp} can be a string or a
  233. compiled regex.
  234. This is similar to @code{regexp-substitute}, but allows global
  235. substitutions on @var{target}. Each @var{item} behaves as per
  236. @code{regexp-substitute}, with the following differences,
  237. @itemize @bullet
  238. @item
  239. A function. Called as @code{(@var{item} match)} with the match
  240. structure for the @var{regexp} match, it should return a string to be
  241. written to @var{port}.
  242. @item
  243. The symbol @samp{post}. This doesn't output anything, but instead
  244. causes @code{regexp-substitute/global} to recurse on the unmatched
  245. portion of @var{target}.
  246. This @emph{must} be supplied to perform a global search and replace on
  247. @var{target}; without it @code{regexp-substitute/global} returns after
  248. a single match and output.
  249. @end itemize
  250. For example, to collapse runs of tabs and spaces to a single hyphen
  251. each,
  252. @example
  253. (regexp-substitute/global #f "[ \t]+" "this is the text"
  254. 'pre "-" 'post)
  255. @result{} "this-is-the-text"
  256. @end example
  257. Or using a function to reverse the letters in each word,
  258. @example
  259. (regexp-substitute/global #f "[a-z]+" "to do and not-do"
  260. 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
  261. @result{} "ot od dna ton-od"
  262. @end example
  263. Without the @code{post} symbol, just one regexp match is made. For
  264. example the following is the date example from
  265. @code{regexp-substitute} above, without the need for the separate
  266. @code{string-match} call.
  267. @lisp
  268. (define date-regex
  269. "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
  270. (define s "Date 20020429 12am.")
  271. (regexp-substitute/global #f date-regex s
  272. 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
  273. @result{} "Date 04-29-2002 12am. (20020429)"
  274. @end lisp
  275. @end deffn
  276. @node Match Structures
  277. @subsection Match Structures
  278. @cindex match structures
  279. A @dfn{match structure} is the object returned by @code{string-match} and
  280. @code{regexp-exec}. It describes which portion of a string, if any,
  281. matched the given regular expression. Match structures include: a
  282. reference to the string that was checked for matches; the starting and
  283. ending positions of the regexp match; and, if the regexp included any
  284. parenthesized subexpressions, the starting and ending positions of each
  285. submatch.
  286. In each of the regexp match functions described below, the @code{match}
  287. argument must be a match structure returned by a previous call to
  288. @code{string-match} or @code{regexp-exec}. Most of these functions
  289. return some information about the original target string that was
  290. matched against a regular expression; we will call that string
  291. @var{target} for easy reference.
  292. @c begin (scm-doc-string "regex.scm" "regexp-match?")
  293. @deffn {Scheme Procedure} regexp-match? obj
  294. Return @code{#t} if @var{obj} is a match structure returned by a
  295. previous call to @code{regexp-exec}, or @code{#f} otherwise.
  296. @end deffn
  297. @c begin (scm-doc-string "regex.scm" "match:substring")
  298. @deffn {Scheme Procedure} match:substring match [n]
  299. Return the portion of @var{target} matched by subexpression number
  300. @var{n}. Submatch 0 (the default) represents the entire regexp match.
  301. If the regular expression as a whole matched, but the subexpression
  302. number @var{n} did not match, return @code{#f}.
  303. @end deffn
  304. @lisp
  305. (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
  306. (match:substring s)
  307. @result{} "2002"
  308. ;; match starting at offset 6 in the string
  309. (match:substring
  310. (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
  311. @result{} "7654"
  312. @end lisp
  313. @c begin (scm-doc-string "regex.scm" "match:start")
  314. @deffn {Scheme Procedure} match:start match [n]
  315. Return the starting position of submatch number @var{n}.
  316. @end deffn
  317. In the following example, the result is 4, since the match starts at
  318. character index 4:
  319. @lisp
  320. (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
  321. (match:start s)
  322. @result{} 4
  323. @end lisp
  324. @c begin (scm-doc-string "regex.scm" "match:end")
  325. @deffn {Scheme Procedure} match:end match [n]
  326. Return the ending position of submatch number @var{n}.
  327. @end deffn
  328. In the following example, the result is 8, since the match runs between
  329. characters 4 and 8 (i.e.@: the ``2002'').
  330. @lisp
  331. (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
  332. (match:end s)
  333. @result{} 8
  334. @end lisp
  335. @c begin (scm-doc-string "regex.scm" "match:prefix")
  336. @deffn {Scheme Procedure} match:prefix match
  337. Return the unmatched portion of @var{target} preceding the regexp match.
  338. @lisp
  339. (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
  340. (match:prefix s)
  341. @result{} "blah"
  342. @end lisp
  343. @end deffn
  344. @c begin (scm-doc-string "regex.scm" "match:suffix")
  345. @deffn {Scheme Procedure} match:suffix match
  346. Return the unmatched portion of @var{target} following the regexp match.
  347. @end deffn
  348. @lisp
  349. (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
  350. (match:suffix s)
  351. @result{} "foo"
  352. @end lisp
  353. @c begin (scm-doc-string "regex.scm" "match:count")
  354. @deffn {Scheme Procedure} match:count match
  355. Return the number of parenthesized subexpressions from @var{match}.
  356. Note that the entire regular expression match itself counts as a
  357. subexpression, and failed submatches are included in the count.
  358. @end deffn
  359. @c begin (scm-doc-string "regex.scm" "match:string")
  360. @deffn {Scheme Procedure} match:string match
  361. Return the original @var{target} string.
  362. @end deffn
  363. @lisp
  364. (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
  365. (match:string s)
  366. @result{} "blah2002foo"
  367. @end lisp
  368. @node Backslash Escapes
  369. @subsection Backslash Escapes
  370. Sometimes you will want a regexp to match characters like @samp{*} or
  371. @samp{$} exactly. For example, to check whether a particular string
  372. represents a menu entry from an Info node, it would be useful to match
  373. it against a regexp like @samp{^* [^:]*::}. However, this won't work;
  374. because the asterisk is a metacharacter, it won't match the @samp{*} at
  375. the beginning of the string. In this case, we want to make the first
  376. asterisk un-magic.
  377. You can do this by preceding the metacharacter with a backslash
  378. character @samp{\}. (This is also called @dfn{quoting} the
  379. metacharacter, and is known as a @dfn{backslash escape}.) When Guile
  380. sees a backslash in a regular expression, it considers the following
  381. glyph to be an ordinary character, no matter what special meaning it
  382. would ordinarily have. Therefore, we can make the above example work by
  383. changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
  384. the regular expression engine to match only a single asterisk in the
  385. target string.
  386. Since the backslash is itself a metacharacter, you may force a regexp to
  387. match a backslash in the target string by preceding the backslash with
  388. itself. For example, to find variable references in a @TeX{} program,
  389. you might want to find occurrences of the string @samp{\let\} followed
  390. by any number of alphabetic characters. The regular expression
  391. @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
  392. regexp each match a single backslash in the target string.
  393. @c begin (scm-doc-string "regex.scm" "regexp-quote")
  394. @deffn {Scheme Procedure} regexp-quote str
  395. Quote each special character found in @var{str} with a backslash, and
  396. return the resulting string.
  397. @end deffn
  398. @strong{Very important:} Using backslash escapes in Guile source code
  399. (as in Emacs Lisp or C) can be tricky, because the backslash character
  400. has special meaning for the Guile reader. For example, if Guile
  401. encounters the character sequence @samp{\n} in the middle of a string
  402. while processing Scheme code, it replaces those characters with a
  403. newline character. Similarly, the character sequence @samp{\t} is
  404. replaced by a horizontal tab. Several of these @dfn{escape sequences}
  405. are processed by the Guile reader before your code is executed.
  406. Unrecognized escape sequences are ignored: if the characters @samp{\*}
  407. appear in a string, they will be translated to the single character
  408. @samp{*}.
  409. This translation is obviously undesirable for regular expressions, since
  410. we want to be able to include backslashes in a string in order to
  411. escape regexp metacharacters. Therefore, to make sure that a backslash
  412. is preserved in a string in your Guile program, you must use @emph{two}
  413. consecutive backslashes:
  414. @lisp
  415. (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
  416. @end lisp
  417. The string in this example is preprocessed by the Guile reader before
  418. any code is executed. The resulting argument to @code{make-regexp} is
  419. the string @samp{^\* [^:]*}, which is what we really want.
  420. This also means that in order to write a regular expression that matches
  421. a single backslash character, the regular expression string in the
  422. source code must include @emph{four} backslashes. Each consecutive pair
  423. of backslashes gets translated by the Guile reader to a single
  424. backslash, and the resulting double-backslash is interpreted by the
  425. regexp engine as matching a single backslash character. Hence:
  426. @lisp
  427. (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
  428. @end lisp
  429. The reason for the unwieldiness of this syntax is historical. Both
  430. regular expression pattern matchers and Unix string processing systems
  431. have traditionally used backslashes with the special meanings
  432. described above. The POSIX regular expression specification and ANSI C
  433. standard both require these semantics. Attempting to abandon either
  434. convention would cause other kinds of compatibility problems, possibly
  435. more severe ones. Therefore, without extending the Scheme reader to
  436. support strings with different quoting conventions (an ungainly and
  437. confusing extension when implemented in other languages), we must adhere
  438. to this cumbersome escape syntax.