api-peg.texi 33 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037
  1. @c -*-texinfo-*-
  2. @c This is part of the GNU Guile Reference Manual.
  3. @c Copyright (C) 2006, 2010, 2011
  4. @c Free Software Foundation, Inc.
  5. @c See the file guile.texi for copying conditions.
  6. @node PEG Parsing
  7. @section PEG Parsing
  8. Parsing Expression Grammars (PEGs) are a way of specifying formal
  9. languages for text processing. They can be used either for matching
  10. (like regular expressions) or for building recursive descent parsers
  11. (like lex/yacc). Guile uses a superset of PEG syntax that allows more
  12. control over what information is preserved during parsing.
  13. Wikipedia has a clear and concise introduction to PEGs if you want to
  14. familiarize yourself with the syntax:
  15. @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
  16. The module works by compiling PEGs down to lambda expressions. These
  17. can either be stored in variables at compile-time by the define macros
  18. (@code{define-peg-pattern} and @code{define-peg-string-patterns}) or calculated
  19. explicitly at runtime with the compile functions
  20. (@code{compile-peg-pattern} and @code{peg-string-compile}).
  21. They can then be used for either parsing (@code{match-pattern}) or searching
  22. (@code{search-for-pattern}). For convenience, @code{search-for-pattern}
  23. also takes pattern literals in case you want to inline a simple search
  24. (people often use regular expressions this way).
  25. The rest of this documentation consists of a syntax reference, an API
  26. reference, and a tutorial.
  27. @menu
  28. * PEG Syntax Reference::
  29. * PEG API Reference::
  30. * PEG Tutorial::
  31. * PEG Internals::
  32. @end menu
  33. @node PEG Syntax Reference
  34. @subsection PEG Syntax Reference
  35. @subsubheading Normal PEG Syntax:
  36. @deftp {PEG Pattern} sequence a b
  37. Parses @var{a}. If this succeeds, continues to parse @var{b} from the
  38. end of the text parsed as @var{a}. Succeeds if both @var{a} and
  39. @var{b} succeed.
  40. @code{"a b"}
  41. @code{(and a b)}
  42. @end deftp
  43. @deftp {PEG Pattern} {ordered choice} a b
  44. Parses @var{a}. If this fails, backtracks and parses @var{b}.
  45. Succeeds if either @var{a} or @var{b} succeeds.
  46. @code{"a/b"}
  47. @code{(or a b)}
  48. @end deftp
  49. @deftp {PEG Pattern} {zero or more} a
  50. Parses @var{a} as many times in a row as it can, starting each @var{a}
  51. at the end of the text parsed by the previous @var{a}. Always
  52. succeeds.
  53. @code{"a*"}
  54. @code{(* a)}
  55. @end deftp
  56. @deftp {PEG Pattern} {one or more} a
  57. Parses @var{a} as many times in a row as it can, starting each @var{a}
  58. at the end of the text parsed by the previous @var{a}. Succeeds if at
  59. least one @var{a} was parsed.
  60. @code{"a+"}
  61. @code{(+ a)}
  62. @end deftp
  63. @deftp {PEG Pattern} optional a
  64. Tries to parse @var{a}. Succeeds if @var{a} succeeds.
  65. @code{"a?"}
  66. @code{(? a)}
  67. @end deftp
  68. @deftp {PEG Pattern} {followed by} a
  69. Makes sure it is possible to parse @var{a}, but does not actually parse
  70. it. Succeeds if @var{a} would succeed.
  71. @code{"&a"}
  72. @code{(followed-by a)}
  73. @end deftp
  74. @deftp {PEG Pattern} {not followed by} a
  75. Makes sure it is impossible to parse @var{a}, but does not actually
  76. parse it. Succeeds if @var{a} would fail.
  77. @code{"!a"}
  78. @code{(not-followed-by a)}
  79. @end deftp
  80. @deftp {PEG Pattern} {string literal} ``abc''
  81. Parses the string @var{"abc"}. Succeeds if that parsing succeeds.
  82. @code{"'abc'"}
  83. @code{"abc"}
  84. @end deftp
  85. @deftp {PEG Pattern} {any character}
  86. Parses any single character. Succeeds unless there is no more text to
  87. be parsed.
  88. @code{"."}
  89. @code{peg-any}
  90. @end deftp
  91. @deftp {PEG Pattern} {character class} a b
  92. Alternative syntax for ``Ordered Choice @var{a} @var{b}'' if @var{a} and
  93. @var{b} are characters.
  94. @code{"[ab]"}
  95. @code{(or "a" "b")}
  96. @end deftp
  97. @deftp {PEG Pattern} {range of characters} a z
  98. Parses any character falling between @var{a} and @var{z}.
  99. @code{"[a-z]"}
  100. @code{(range #\a #\z)}
  101. @end deftp
  102. Example:
  103. @example
  104. "(a !b / c &d*) 'e'+"
  105. @end example
  106. Would be:
  107. @lisp
  108. (and
  109. (or
  110. (and a (not-followed-by b))
  111. (and c (followed-by (* d))))
  112. (+ "e"))
  113. @end lisp
  114. @subsubheading Extended Syntax
  115. There is some extra syntax for S-expressions.
  116. @deftp {PEG Pattern} ignore a
  117. Ignore the text matching @var{a}
  118. @end deftp
  119. @deftp {PEG Pattern} capture a
  120. Capture the text matching @var{a}.
  121. @end deftp
  122. @deftp {PEG Pattern} peg a
  123. Embed the PEG pattern @var{a} using string syntax.
  124. @end deftp
  125. Example:
  126. @example
  127. "!a / 'b'"
  128. @end example
  129. Is equivalent to
  130. @lisp
  131. (or (peg "!a") "b")
  132. @end lisp
  133. and
  134. @lisp
  135. (or (not-followed-by a) "b")
  136. @end lisp
  137. @node PEG API Reference
  138. @subsection PEG API Reference
  139. @subsubheading Define Macros
  140. The most straightforward way to define a PEG is by using one of the
  141. define macros (both of these macroexpand into @code{define}
  142. expressions). These macros bind parsing functions to variables. These
  143. parsing functions may be invoked by @code{match-pattern} or
  144. @code{search-for-pattern}, which return a PEG match record. Raw data can be
  145. retrieved from this record with the PEG match deconstructor functions.
  146. More complicated (and perhaps enlightening) examples can be found in the
  147. tutorial.
  148. @deffn {Scheme Macro} define-peg-string-patterns peg-string
  149. Defines all the nonterminals in the PEG @var{peg-string}. More
  150. precisely, @code{define-peg-string-patterns} takes a superset of PEGs. A normal PEG
  151. has a @code{<-} between the nonterminal and the pattern.
  152. @code{define-peg-string-patterns} uses this symbol to determine what information it
  153. should propagate up the parse tree. The normal @code{<-} propagates the
  154. matched text up the parse tree, @code{<--} propagates the matched text
  155. up the parse tree tagged with the name of the nonterminal, and @code{<}
  156. discards that matched text and propagates nothing up the parse tree.
  157. Also, nonterminals may consist of any alphanumeric character or a ``-''
  158. character (in normal PEGs nonterminals can only be alphabetic).
  159. For example, if we:
  160. @lisp
  161. (define-peg-string-patterns
  162. "as <- 'a'+
  163. bs <- 'b'+
  164. as-or-bs <- as/bs")
  165. (define-peg-string-patterns
  166. "as-tag <-- 'a'+
  167. bs-tag <-- 'b'+
  168. as-or-bs-tag <-- as-tag/bs-tag")
  169. @end lisp
  170. Then:
  171. @lisp
  172. (match-pattern as-or-bs "aabbcc") @result{}
  173. #<peg start: 0 end: 2 string: aabbcc tree: aa>
  174. (match-pattern as-or-bs-tag "aabbcc") @result{}
  175. #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
  176. @end lisp
  177. Note that in doing this, we have bound 6 variables at the toplevel
  178. (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
  179. @var{as-or-bs-tag}).
  180. @end deffn
  181. @deffn {Scheme Macro} define-peg-pattern name capture-type peg-sexp
  182. Defines a single nonterminal @var{name}. @var{capture-type} determines
  183. how much information is passed up the parse tree. @var{peg-sexp} is a
  184. PEG in S-expression form.
  185. Possible values for capture-type:
  186. @table @code
  187. @item all
  188. passes the matched text up the parse tree tagged with the name of the
  189. nonterminal.
  190. @item body
  191. passes the matched text up the parse tree.
  192. @item none
  193. passes nothing up the parse tree.
  194. @end table
  195. For Example, if we:
  196. @lisp
  197. (define-peg-pattern as body (+ "a"))
  198. (define-peg-pattern bs body (+ "b"))
  199. (define-peg-pattern as-or-bs body (or as bs))
  200. (define-peg-pattern as-tag all (+ "a"))
  201. (define-peg-pattern bs-tag all (+ "b"))
  202. (define-peg-pattern as-or-bs-tag all (or as-tag bs-tag))
  203. @end lisp
  204. Then:
  205. @lisp
  206. (match-pattern as-or-bs "aabbcc") @result{}
  207. #<peg start: 0 end: 2 string: aabbcc tree: aa>
  208. (match-pattern as-or-bs-tag "aabbcc") @result{}
  209. #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
  210. @end lisp
  211. Note that in doing this, we have bound 6 variables at the toplevel
  212. (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
  213. @var{as-or-bs-tag}).
  214. @end deffn
  215. @subsubheading Compile Functions
  216. It is sometimes useful to be able to compile anonymous PEG patterns at
  217. runtime. These functions let you do that using either syntax.
  218. @deffn {Scheme Procedure} peg-string-compile peg-string capture-type
  219. Compiles the PEG pattern in @var{peg-string} propagating according to
  220. @var{capture-type} (capture-type can be any of the values from
  221. @code{define-peg-pattern}).
  222. @end deffn
  223. @deffn {Scheme Procedure} compile-peg-pattern peg-sexp capture-type
  224. Compiles the PEG pattern in @var{peg-sexp} propagating according to
  225. @var{capture-type} (capture-type can be any of the values from
  226. @code{define-peg-pattern}).
  227. @end deffn
  228. The functions return syntax objects, which can be useful if you want to
  229. use them in macros. If all you want is to define a new nonterminal, you
  230. can do the following:
  231. @lisp
  232. (define exp '(+ "a"))
  233. (define as (compile (compile-peg-pattern exp 'body)))
  234. @end lisp
  235. You can use this nonterminal with all of the regular PEG functions:
  236. @lisp
  237. (match-pattern as "aaaaa") @result{}
  238. #<peg start: 0 end: 5 string: bbbbb tree: bbbbb>
  239. @end lisp
  240. @subsubheading Parsing & Matching Functions
  241. For our purposes, ``parsing'' means parsing a string into a tree
  242. starting from the first character, while ``matching'' means searching
  243. through the string for a substring. In practice, the only difference
  244. between the two functions is that @code{match-pattern} gives up if it can't
  245. find a valid substring starting at index 0 and @code{search-for-pattern} keeps
  246. looking. They are both equally capable of ``parsing'' and ``matching''
  247. given those constraints.
  248. @deffn {Scheme Procedure} match-pattern nonterm string
  249. Parses @var{string} using the PEG stored in @var{nonterm}. If no match
  250. was found, @code{match-pattern} returns false. If a match was found, a PEG
  251. match record is returned.
  252. The @code{capture-type} argument to @code{define-peg-pattern} allows you to
  253. choose what information to hold on to while parsing. The options are:
  254. @table @code
  255. @item all
  256. tag the matched text with the nonterminal
  257. @item body
  258. just the matched text
  259. @item none
  260. nothing
  261. @end table
  262. @lisp
  263. (define-peg-pattern as all (+ "a"))
  264. (match-pattern as "aabbcc") @result{}
  265. #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
  266. (define-peg-pattern as body (+ "a"))
  267. (match-pattern as "aabbcc") @result{}
  268. #<peg start: 0 end: 2 string: aabbcc tree: aa>
  269. (define-peg-pattern as none (+ "a"))
  270. (match-pattern as "aabbcc") @result{}
  271. #<peg start: 0 end: 2 string: aabbcc tree: ()>
  272. (define-peg-pattern bs body (+ "b"))
  273. (match-pattern bs "aabbcc") @result{}
  274. #f
  275. @end lisp
  276. @end deffn
  277. @deffn {Scheme Macro} search-for-pattern nonterm-or-peg string
  278. Searches through @var{string} looking for a matching subexpression.
  279. @var{nonterm-or-peg} can either be a nonterminal or a literal PEG
  280. pattern. When a literal PEG pattern is provided, @code{search-for-pattern} works
  281. very similarly to the regular expression searches many hackers are used
  282. to. If no match was found, @code{search-for-pattern} returns false. If a match
  283. was found, a PEG match record is returned.
  284. @lisp
  285. (define-peg-pattern as body (+ "a"))
  286. (search-for-pattern as "aabbcc") @result{}
  287. #<peg start: 0 end: 2 string: aabbcc tree: aa>
  288. (search-for-pattern (+ "a") "aabbcc") @result{}
  289. #<peg start: 0 end: 2 string: aabbcc tree: aa>
  290. (search-for-pattern "'a'+" "aabbcc") @result{}
  291. #<peg start: 0 end: 2 string: aabbcc tree: aa>
  292. (define-peg-pattern as all (+ "a"))
  293. (search-for-pattern as "aabbcc") @result{}
  294. #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
  295. (define-peg-pattern bs body (+ "b"))
  296. (search-for-pattern bs "aabbcc") @result{}
  297. #<peg start: 2 end: 4 string: aabbcc tree: bb>
  298. (search-for-pattern (+ "b") "aabbcc") @result{}
  299. #<peg start: 2 end: 4 string: aabbcc tree: bb>
  300. (search-for-pattern "'b'+" "aabbcc") @result{}
  301. #<peg start: 2 end: 4 string: aabbcc tree: bb>
  302. (define-peg-pattern zs body (+ "z"))
  303. (search-for-pattern zs "aabbcc") @result{}
  304. #f
  305. (search-for-pattern (+ "z") "aabbcc") @result{}
  306. #f
  307. (search-for-pattern "'z'+" "aabbcc") @result{}
  308. #f
  309. @end lisp
  310. @end deffn
  311. @subsubheading PEG Match Records
  312. The @code{match-pattern} and @code{search-for-pattern} functions both return PEG
  313. match records. Actual information can be extracted from these with the
  314. following functions.
  315. @deffn {Scheme Procedure} peg:string match-record
  316. Returns the original string that was parsed in the creation of
  317. @code{match-record}.
  318. @end deffn
  319. @deffn {Scheme Procedure} peg:start match-record
  320. Returns the index of the first parsed character in the original string
  321. (from @code{peg:string}). If this is the same as @code{peg:end},
  322. nothing was parsed.
  323. @end deffn
  324. @deffn {Scheme Procedure} peg:end match-record
  325. Returns one more than the index of the last parsed character in the
  326. original string (from @code{peg:string}). If this is the same as
  327. @code{peg:start}, nothing was parsed.
  328. @end deffn
  329. @deffn {Scheme Procedure} peg:substring match-record
  330. Returns the substring parsed by @code{match-record}. This is equivalent to
  331. @code{(substring (peg:string match-record) (peg:start match-record) (peg:end
  332. match-record))}.
  333. @end deffn
  334. @deffn {Scheme Procedure} peg:tree match-record
  335. Returns the tree parsed by @code{match-record}.
  336. @end deffn
  337. @deffn {Scheme Procedure} peg-record? match-record
  338. Returns true if @code{match-record} is a PEG match record, or false
  339. otherwise.
  340. @end deffn
  341. Example:
  342. @lisp
  343. (define-peg-pattern bs all (peg "'b'+"))
  344. (search-for-pattern bs "aabbcc") @result{}
  345. #<peg start: 2 end: 4 string: aabbcc tree: (bs bb)>
  346. (let ((pm (search-for-pattern bs "aabbcc")))
  347. `((string ,(peg:string pm))
  348. (start ,(peg:start pm))
  349. (end ,(peg:end pm))
  350. (substring ,(peg:substring pm))
  351. (tree ,(peg:tree pm))
  352. (record? ,(peg-record? pm)))) @result{}
  353. ((string "aabbcc")
  354. (start 2)
  355. (end 4)
  356. (substring "bb")
  357. (tree (bs "bb"))
  358. (record? #t))
  359. @end lisp
  360. @subsubheading Miscellaneous
  361. @deffn {Scheme Procedure} context-flatten tst lst
  362. Takes a predicate @var{tst} and a list @var{lst}. Flattens @var{lst}
  363. until all elements are either atoms or satisfy @var{tst}. If @var{lst}
  364. itself satisfies @var{tst}, @code{(list lst)} is returned (this is a
  365. flat list whose only element satisfies @var{tst}).
  366. @lisp
  367. (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
  368. (2 2 (1 1 (2 2)) 2 2 (1 1))
  369. (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
  370. ((1 1 (1 1 (2 2)) (2 2 (1 1))))
  371. @end lisp
  372. If you're wondering why this is here, take a look at the tutorial.
  373. @end deffn
  374. @deffn {Scheme Procedure} keyword-flatten terms lst
  375. A less general form of @code{context-flatten}. Takes a list of terminal
  376. atoms @code{terms} and flattens @var{lst} until all elements are either
  377. atoms, or lists which have an atom from @code{terms} as their first
  378. element.
  379. @lisp
  380. (keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{}
  381. (c a b (a c) (b c) c (b a) c a)
  382. @end lisp
  383. If you're wondering why this is here, take a look at the tutorial.
  384. @end deffn
  385. @node PEG Tutorial
  386. @subsection PEG Tutorial
  387. @subsubheading Parsing /etc/passwd
  388. This example will show how to parse /etc/passwd using PEGs.
  389. First we define an example /etc/passwd file:
  390. @lisp
  391. (define *etc-passwd*
  392. "root:x:0:0:root:/root:/bin/bash
  393. daemon:x:1:1:daemon:/usr/sbin:/bin/sh
  394. bin:x:2:2:bin:/bin:/bin/sh
  395. sys:x:3:3:sys:/dev:/bin/sh
  396. nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
  397. messagebus:x:103:107::/var/run/dbus:/bin/false
  398. ")
  399. @end lisp
  400. As a first pass at this, we might want to have all the entries in
  401. /etc/passwd in a list.
  402. Doing this with string-based PEG syntax would look like this:
  403. @lisp
  404. (define-peg-string-patterns
  405. "passwd <- entry* !.
  406. entry <-- (! NL .)* NL*
  407. NL < '\n'")
  408. @end lisp
  409. A @code{passwd} file is 0 or more entries (@code{entry*}) until the end
  410. of the file (@code{!.} (@code{.} is any character, so @code{!.} means
  411. ``not anything'')). We want to capture the data in the nonterminal
  412. @code{passwd}, but not tag it with the name, so we use @code{<-}.
  413. An entry is a series of 0 or more characters that aren't newlines
  414. (@code{(! NL .)*}) followed by 0 or more newlines (@code{NL*}). We want
  415. to tag all the entries with @code{entry}, so we use @code{<--}.
  416. A newline is just a literal newline (@code{'\n'}). We don't want a
  417. bunch of newlines cluttering up the output, so we use @code{<} to throw
  418. away the captured data.
  419. Here is the same PEG defined using S-expressions:
  420. @lisp
  421. (define-peg-pattern passwd body (and (* entry) (not-followed-by peg-any)))
  422. (define-peg-pattern entry all (and (* (and (not-followed-by NL) peg-any))
  423. (* NL)))
  424. (define-peg-pattern NL none "\n")
  425. @end lisp
  426. Obviously this is much more verbose. On the other hand, it's more
  427. explicit, and thus easier to build automatically. However, there are
  428. some tricks that make S-expressions easier to use in some cases. One is
  429. the @code{ignore} keyword; the string syntax has no way to say ``throw
  430. away this text'' except breaking it out into a separate nonterminal.
  431. For instance, to throw away the newlines we had to define @code{NL}. In
  432. the S-expression syntax, we could have simply written @code{(ignore
  433. "\n")}. Also, for the cases where string syntax is really much cleaner,
  434. the @code{peg} keyword can be used to embed string syntax in
  435. S-expression syntax. For instance, we could have written:
  436. @lisp
  437. (define-peg-pattern passwd body (peg "entry* !."))
  438. @end lisp
  439. However we define it, parsing @code{*etc-passwd*} with the @code{passwd}
  440. nonterminal yields the same results:
  441. @lisp
  442. (peg:tree (match-pattern passwd *etc-passwd*)) @result{}
  443. ((entry "root:x:0:0:root:/root:/bin/bash")
  444. (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
  445. (entry "bin:x:2:2:bin:/bin:/bin/sh")
  446. (entry "sys:x:3:3:sys:/dev:/bin/sh")
  447. (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
  448. (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
  449. @end lisp
  450. However, here is something to be wary of:
  451. @lisp
  452. (peg:tree (match-pattern passwd "one entry")) @result{}
  453. (entry "one entry")
  454. @end lisp
  455. By default, the parse trees generated by PEGs are compressed as much as
  456. possible without losing information. It may not look like this is what
  457. you want at first, but uncompressed parse trees are an enormous headache
  458. (there's no easy way to predict how deep particular lists will nest,
  459. there are empty lists littered everywhere, etc. etc.). One side-effect
  460. of this, however, is that sometimes the compressor is too aggressive.
  461. No information is discarded when @code{((entry "one entry"))} is
  462. compressed to @code{(entry "one entry")}, but in this particular case it
  463. probably isn't what we want.
  464. There are two functions for easily dealing with this:
  465. @code{keyword-flatten} and @code{context-flatten}. The
  466. @code{keyword-flatten} function takes a list of keywords and a list to
  467. flatten, then tries to coerce the list such that the first element of
  468. all sublists is one of the keywords. The @code{context-flatten}
  469. function is similar, but instead of a list of keywords it takes a
  470. predicate that should indicate whether a given sublist is good enough
  471. (refer to the API reference for more details).
  472. What we want here is @code{keyword-flatten}.
  473. @lisp
  474. (keyword-flatten '(entry) (peg:tree (match-pattern passwd *etc-passwd*))) @result{}
  475. ((entry "root:x:0:0:root:/root:/bin/bash")
  476. (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
  477. (entry "bin:x:2:2:bin:/bin:/bin/sh")
  478. (entry "sys:x:3:3:sys:/dev:/bin/sh")
  479. (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
  480. (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
  481. (keyword-flatten '(entry) (peg:tree (match-pattern passwd "one entry"))) @result{}
  482. ((entry "one entry"))
  483. @end lisp
  484. Of course, this is a somewhat contrived example. In practice we would
  485. probably just tag the @code{passwd} nonterminal to remove the ambiguity
  486. (using either the @code{all} keyword for S-expressions or the @code{<--}
  487. symbol for strings)..
  488. @lisp
  489. (define-peg-pattern tag-passwd all (peg "entry* !."))
  490. (peg:tree (match-pattern tag-passwd *etc-passwd*)) @result{}
  491. (tag-passwd
  492. (entry "root:x:0:0:root:/root:/bin/bash")
  493. (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
  494. (entry "bin:x:2:2:bin:/bin:/bin/sh")
  495. (entry "sys:x:3:3:sys:/dev:/bin/sh")
  496. (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
  497. (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
  498. (peg:tree (match-pattern tag-passwd "one entry"))
  499. (tag-passwd
  500. (entry "one entry"))
  501. @end lisp
  502. If you're ever uncertain about the potential results of parsing
  503. something, remember the two absolute rules:
  504. @enumerate
  505. @item
  506. No parsing information will ever be discarded.
  507. @item
  508. There will never be any lists with fewer than 2 elements.
  509. @end enumerate
  510. For the purposes of (1), "parsing information" means things tagged with
  511. the @code{any} keyword or the @code{<--} symbol. Plain strings will be
  512. concatenated.
  513. Let's extend this example a bit more and actually pull some useful
  514. information out of the passwd file:
  515. @lisp
  516. (define-peg-string-patterns
  517. "passwd <-- entry* !.
  518. entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL*
  519. login <-- text
  520. pass <-- text
  521. uid <-- [0-9]*
  522. gid <-- [0-9]*
  523. nameORcomment <-- text
  524. homedir <-- path
  525. shell <-- path
  526. path <-- (SLASH pathELEMENT)*
  527. pathELEMENT <-- (!NL !C !'/' .)*
  528. text <- (!NL !C .)*
  529. C < ':'
  530. NL < '\n'
  531. SLASH < '/'")
  532. @end lisp
  533. This produces rather pretty parse trees:
  534. @lisp
  535. (passwd
  536. (entry (login "root")
  537. (pass "x")
  538. (uid "0")
  539. (gid "0")
  540. (nameORcomment "root")
  541. (homedir (path (pathELEMENT "root")))
  542. (shell (path (pathELEMENT "bin") (pathELEMENT "bash"))))
  543. (entry (login "daemon")
  544. (pass "x")
  545. (uid "1")
  546. (gid "1")
  547. (nameORcomment "daemon")
  548. (homedir
  549. (path (pathELEMENT "usr") (pathELEMENT "sbin")))
  550. (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
  551. (entry (login "bin")
  552. (pass "x")
  553. (uid "2")
  554. (gid "2")
  555. (nameORcomment "bin")
  556. (homedir (path (pathELEMENT "bin")))
  557. (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
  558. (entry (login "sys")
  559. (pass "x")
  560. (uid "3")
  561. (gid "3")
  562. (nameORcomment "sys")
  563. (homedir (path (pathELEMENT "dev")))
  564. (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
  565. (entry (login "nobody")
  566. (pass "x")
  567. (uid "65534")
  568. (gid "65534")
  569. (nameORcomment "nobody")
  570. (homedir (path (pathELEMENT "nonexistent")))
  571. (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
  572. (entry (login "messagebus")
  573. (pass "x")
  574. (uid "103")
  575. (gid "107")
  576. nameORcomment
  577. (homedir
  578. (path (pathELEMENT "var")
  579. (pathELEMENT "run")
  580. (pathELEMENT "dbus")))
  581. (shell (path (pathELEMENT "bin") (pathELEMENT "false")))))
  582. @end lisp
  583. Notice that when there's no entry in a field (e.g. @code{nameORcomment}
  584. for messagebus) the symbol is inserted. This is the ``don't throw away
  585. any information'' rule---we succesfully matched a @code{nameORcomment}
  586. of 0 characters (since we used @code{*} when defining it). This is
  587. usually what you want, because it allows you to e.g. use @code{list-ref}
  588. to pull out elements (since they all have known offsets).
  589. If you'd prefer not to have symbols for empty matches, you can replace
  590. the @code{*} with a @code{+} and add a @code{?} after the
  591. @code{nameORcomment} in @code{entry}. Then it will try to parse 1 or
  592. more characters, fail (inserting nothing into the parse tree), but
  593. continue because it didn't have to match the nameORcomment to continue.
  594. @subsubheading Embedding Arithmetic Expressions
  595. We can parse simple mathematical expressions with the following PEG:
  596. @lisp
  597. (define-peg-string-patterns
  598. "expr <- sum
  599. sum <-- (product ('+' / '-') sum) / product
  600. product <-- (value ('*' / '/') product) / value
  601. value <-- number / '(' expr ')'
  602. number <-- [0-9]+")
  603. @end lisp
  604. Then:
  605. @lisp
  606. (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2")) @result{}
  607. (sum (product (value (number "1")))
  608. "+"
  609. (sum (product
  610. (value (number "1"))
  611. "/"
  612. (product
  613. (value (number "2"))
  614. "*"
  615. (product (value (number "3")))))
  616. "+"
  617. (sum (product
  618. (value "("
  619. (sum (product (value (number "1")))
  620. "+"
  621. (sum (product (value (number "1")))))
  622. ")")
  623. "/"
  624. (product (value (number "2")))))))
  625. @end lisp
  626. There is very little wasted effort in this PEG. The @code{number}
  627. nonterminal has to be tagged because otherwise the numbers might run
  628. together with the arithmetic expressions during the string concatenation
  629. stage of parse-tree compression (the parser will see ``1'' followed by
  630. ``/'' and decide to call it ``1/''). When in doubt, tag.
  631. It is very easy to turn these parse trees into lisp expressions:
  632. @lisp
  633. (define (parse-sum sum left . rest)
  634. (if (null? rest)
  635. (apply parse-product left)
  636. (list (string->symbol (car rest))
  637. (apply parse-product left)
  638. (apply parse-sum (cadr rest)))))
  639. (define (parse-product product left . rest)
  640. (if (null? rest)
  641. (apply parse-value left)
  642. (list (string->symbol (car rest))
  643. (apply parse-value left)
  644. (apply parse-product (cadr rest)))))
  645. (define (parse-value value first . rest)
  646. (if (null? rest)
  647. (string->number (cadr first))
  648. (apply parse-sum (car rest))))
  649. (define parse-expr parse-sum)
  650. @end lisp
  651. (Notice all these functions look very similar; for a more complicated
  652. PEG, it would be worth abstracting.)
  653. Then:
  654. @lisp
  655. (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
  656. (+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2)))
  657. @end lisp
  658. But wait! The associativity is wrong! Where it says @code{(/ 1 (* 2
  659. 3))}, it should say @code{(* (/ 1 2) 3)}.
  660. It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-')
  661. sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) /
  662. product"}, but this is a Bad Idea. PEGs don't support left recursion.
  663. To see why, imagine what the parser will do here. When it tries to
  664. parse @code{sum}, it first has to try and parse @code{sum}. But to do
  665. that, it first has to try and parse @code{sum}. This will continue
  666. until the stack gets blown off.
  667. So how does one parse left-associative binary operators with PEGs?
  668. Honestly, this is one of their major shortcomings. There's no
  669. general-purpose way of doing this, but here the repetition operators are
  670. a good choice:
  671. @lisp
  672. (use-modules (srfi srfi-1))
  673. (define-peg-string-patterns
  674. "expr <- sum
  675. sum <-- (product ('+' / '-'))* product
  676. product <-- (value ('*' / '/'))* value
  677. value <-- number / '(' expr ')'
  678. number <-- [0-9]+")
  679. ;; take a deep breath...
  680. (define (make-left-parser next-func)
  681. (lambda (sum first . rest) ;; general form, comments below assume
  682. ;; that we're dealing with a sum expression
  683. (if (null? rest) ;; form (sum (product ...))
  684. (apply next-func first)
  685. (if (string? (cadr first));; form (sum ((product ...) "+") (product ...))
  686. (list (string->symbol (cadr first))
  687. (apply next-func (car first))
  688. (apply next-func (car rest)))
  689. ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
  690. (car
  691. (reduce ;; walk through the list and build a left-associative tree
  692. (lambda (l r)
  693. (list (list (cadr r) (car r) (apply next-func (car l)))
  694. (string->symbol (cadr l))))
  695. 'ignore
  696. (append ;; make a list of all the products
  697. ;; the first one should be pre-parsed
  698. (list (list (apply next-func (caar first))
  699. (string->symbol (cadar first))))
  700. (cdr first)
  701. ;; the last one has to be added in
  702. (list (append rest '("done"))))))))))
  703. (define (parse-value value first . rest)
  704. (if (null? rest)
  705. (string->number (cadr first))
  706. (apply parse-sum (car rest))))
  707. (define parse-product (make-left-parser parse-value))
  708. (define parse-sum (make-left-parser parse-product))
  709. (define parse-expr parse-sum)
  710. @end lisp
  711. Then:
  712. @lisp
  713. (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
  714. (+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))
  715. @end lisp
  716. As you can see, this is much uglier (it could be made prettier by using
  717. @code{context-flatten}, but the way it's written above makes it clear
  718. how we deal with the three ways the zero-or-more @code{*} expression can
  719. parse). Fortunately, most of the time we can get away with only using
  720. right-associativity.
  721. @subsubheading Simplified Functions
  722. For a more tantalizing example, consider the following grammar that
  723. parses (highly) simplified C functions:
  724. @lisp
  725. (define-peg-string-patterns
  726. "cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB
  727. ctype <-- cidentifier
  728. cname <-- cidentifier
  729. cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP
  730. carg <-- cSP ctype cSP cname
  731. cbody <-- cstatement *
  732. cidentifier <- [a-zA-z][a-zA-Z0-9_]*
  733. cstatement <-- (!';'.)*cSC cSP
  734. cSC < ';'
  735. cCOMMA < ','
  736. cLP < '('
  737. cRP < ')'
  738. cLB < '@{'
  739. cRB < '@}'
  740. cSP < [ \t\n]*")
  741. @end lisp
  742. Then:
  743. @lisp
  744. (match-pattern cfunc "int square(int a) @{ return a*a;@}") @result{}
  745. (32
  746. (cfunc (ctype "int")
  747. (cname "square")
  748. (cargs (carg (ctype "int") (cname "a")))
  749. (cbody (cstatement "return a*a"))))
  750. @end lisp
  751. And:
  752. @lisp
  753. (match-pattern cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{}
  754. (52
  755. (cfunc (ctype "int")
  756. (cname "mod")
  757. (cargs (carg (ctype "int") (cname "a"))
  758. (carg (ctype "int") (cname "b")))
  759. (cbody (cstatement "int c = a/b")
  760. (cstatement "return a- b*c"))))
  761. @end lisp
  762. By wrapping all the @code{carg} nonterminals in a @code{cargs}
  763. nonterminal, we were able to remove any ambiguity in the parsing
  764. structure and avoid having to call @code{context-flatten} on the output
  765. of @code{match-pattern}. We used the same trick with the @code{cstatement}
  766. nonterminals, wrapping them in a @code{cbody} nonterminal.
  767. The whitespace nonterminal @code{cSP} used here is a (very) useful
  768. instantiation of a common pattern for matching syntactically irrelevant
  769. information. Since it's tagged with @code{<} and ends with @code{*} it
  770. won't clutter up the parse trees (all the empty lists will be discarded
  771. during the compression step) and it will never cause parsing to fail.
  772. @node PEG Internals
  773. @subsection PEG Internals
  774. A PEG parser takes a string as input and attempts to parse it as a given
  775. nonterminal. The key idea of the PEG implementation is that every
  776. nonterminal is just a function that takes a string as an argument and
  777. attempts to parse that string as its nonterminal. The functions always
  778. start from the beginning, but a parse is considered successful if there
  779. is material left over at the end.
  780. This makes it easy to model different PEG parsing operations. For
  781. instance, consider the PEG grammar @code{"ab"}, which could also be
  782. written @code{(and "a" "b")}. It matches the string ``ab''. Here's how
  783. that might be implemented in the PEG style:
  784. @lisp
  785. (define (match-and-a-b str)
  786. (match-a str)
  787. (match-b str))
  788. @end lisp
  789. As you can see, the use of functions provides an easy way to model
  790. sequencing. In a similar way, one could model @code{(or a b)} with
  791. something like the following:
  792. @lisp
  793. (define (match-or-a-b str)
  794. (or (match-a str) (match-b str)))
  795. @end lisp
  796. Here the semantics of a PEG @code{or} expression map naturally onto
  797. Scheme's @code{or} operator. This function will attempt to run
  798. @code{(match-a str)}, and return its result if it succeeds. Otherwise it
  799. will run @code{(match-b str)}.
  800. Of course, the code above wouldn't quite work. We need some way for the
  801. parsing functions to communicate. The actual interface used is below.
  802. @subsubheading Parsing Function Interface
  803. A parsing function takes three arguments - a string, the length of that
  804. string, and the position in that string it should start parsing at. In
  805. effect, the parsing functions pass around substrings in pieces - the
  806. first argument is a buffer of characters, and the second two give a
  807. range within that buffer that the parsing function should look at.
  808. Parsing functions return either #f, if they failed to match their
  809. nonterminal, or a list whose first element must be an integer
  810. representing the final position in the string they matched and whose cdr
  811. can be any other data the function wishes to return, or '() if it
  812. doesn't have any more data.
  813. The one caveat is that if the extra data it returns is a list, any
  814. adjacent strings in that list will be appended by @code{match-pattern}. For
  815. instance, if a parsing function returns @code{(13 ("a" "b" "c"))},
  816. @code{match-pattern} will take @code{(13 ("abc"))} as its value.
  817. For example, here is a function to match ``ab'' using the actual
  818. interface.
  819. @lisp
  820. (define (match-a-b str len pos)
  821. (and (<= (+ pos 2) len)
  822. (string= str "ab" pos (+ pos 2))
  823. (list (+ pos 2) '()))) ; we return no extra information
  824. @end lisp
  825. The above function can be used to match a string by running
  826. @code{(match-pattern match-a-b "ab")}.
  827. @subsubheading Code Generators and Extensible Syntax
  828. PEG expressions, such as those in a @code{define-peg-pattern} form, are
  829. interpreted internally in two steps.
  830. First, any string PEG is expanded into an s-expression PEG by the code
  831. in the @code{(ice-9 peg string-peg)} module.
  832. Then, then s-expression PEG that results is compiled into a parsing
  833. function by the @code{(ice-9 peg codegen)} module. In particular, the
  834. function @code{compile-peg-pattern} is called on the s-expression. It then
  835. decides what to do based on the form it is passed.
  836. The PEG syntax can be expanded by providing @code{compile-peg-pattern} more
  837. options for what to do with its forms. The extended syntax will be
  838. associated with a symbol, for instance @code{my-parsing-form}, and will
  839. be called on all PEG expressions of the form
  840. @lisp
  841. (my-parsing-form ...)
  842. @end lisp
  843. The parsing function should take two arguments. The first will be a
  844. syntax object containing a list with all of the arguments to the form
  845. (but not the form's name), and the second will be the
  846. @code{capture-type} argument that is passed to @code{define-peg-pattern}.
  847. New functions can be registered by calling @code{(add-peg-compiler!
  848. symbol function)}, where @code{symbol} is the symbol that will indicate
  849. a form of this type and @code{function} is the code generating function
  850. described above. The function @code{add-peg-compiler!} is exported from
  851. the @code{(ice-9 peg codegen)} module.