chapter-01.org 11 KB

Introduction

This document contains notes about the book "Programming in Standard ML" by Robert Harper.

Keywords

val

The keyword val introduces a new binding:

val something = 3; something;

:results: val it = 3 : int :end:

A regular expression package

Specification / Type Declaration

What follows is a definition of signatures of a regular expression package, as described in the book.

signature REGEXP = sig datatype regexp = Zero | One | Char of char | Plus of regexp * regexp | Times of regexp * regexp | Star of regexp exception SyntaxError of string val parse : string -> regexp val format : regexp -> string end

signature MATCHER = sig structure RegExp : REGEXP val accepts : RegExp.regexp -> string -> bool end

:results: signature REGEXP sig datatype regexp Char of char | One | | Plus of regexp * regexp | | Star of regexp | | Times of regexp * regexp | | Zero | exception SyntaxError of string val parse : string -> regexp val format : regexp -> string end signature MATCHER sig structure RegExp : sig datatype regexp Char of char | One | | Plus of regexp * regexp | | Star of regexp | | Times of regexp * regexp | | Zero | exception SyntaxError of string val parse : string -> regexp val format : regexp -> string end val accepts : RegExp.regexp -> string -> bool end :end:

What can we get from the written definitions and the text in the book?

  • signature is used to describe modules. Apparently this is how one describes
  • the types inside modules in SML.
  • datatype is used to define a type regexp (regular expression), which is
  • one of the stated alternatives.
  • Alternatives for datatype are assigned using the equal sign =. The idea
  • is, that the the datatype really is the same as the union of all the listed types.
  • Alternatives for datatype are separated by pipes |.
  • val is used to introduce binding definitions inside module descriptions as
  • well.
  • Types of function bindings are specified using the colon : after the binding
  • name. The functions are not equal to the types, which is probably why the ~=~ sign is not used.
  • SML has the concept of exceptions.
  • SML seems to start definitions like signature definitions with a keyword, in
  • this example ~sig~ and end them using the keyword ~end~. (But why is this not used for ~structure~? Probably, because ~structure~ only allows one thing to follow after it and that is why the expression is unambigious. Or perhaps the difference is made between block expressions like ~sig~ - ~end~ and statements.
  • Function definitions use the arrow -> to separate types of input arguments
  • and types of result values. If a function takes multiple arguments, it is probably curried and therefore written with multiple arrows ~->~. For example the function ~accepts~ takes a ~RegExp.regexp~ and a ~string~ and returns a ~bool~, but perhaps it is curried and works by first taking a ~RegExp.regexp~ and returns a function, which takes a ~string~ and that function then returns the ~bool~.
  • There seems to be a convention of naming signatures in all capital letters.
  • MATCHER depends on REGEXP.
  • Signatures are descriptions of what needs to be implemented.
  • Structures implement signatures.
  • In SMLNJ the dot ~.~ is used to access structure components. For example
  • ~MyStructure.TheComponent~.

The book goes on to show how one would write structure declarations:

structure RegExp :> REGEXP = ... structure Matcher :> MATCHER = ...

From this we can gather, that apparently to implement a signature using a structure, one writes not :, but :> instead and then declare the structure to be equal to the actual implementation. The actual implementation needs to conform to the signature REGEXP. This is type checked. SML also type checks, that the implementation does not make use of anything, that is not specified in the signature.

The book shows then a usage example:

val regexp = Matcher.RegExp.parse "(a+b)*" val matcher = Matcher.accepts regexp (* currying in action, returning a function *) val ex1 = matches "aabba" (* true *) val ex2 = matches "abac" (* false *)

One should use the long identifier Matcher.RegExp.parse instead of ~RegExp.parse~, since SML makes a difference between the two and there can be issues with using RegExp.parse instead. The book calls this sharing, as the implementation of RegExp.parse, if used directly, could be used in multiple contexts in the code. However, always writing the long identifier, the full path to the function or member could become unreadable and typing it annoying. There are ways to alleviate this problem:

structure M = Matcher structure R = M.RegExp (* can also use dots here *) val regexp = R.parse "((a + %).(b + %))*" val matches = M.accepts regexp val ex1 = matches "aabba" val ex2 = matches "abac"

:results: stdIn:36.15-36.22 Error: unbound structure: Matcher :end:

Implementation

The book shows an overview of an implementation for the regular expression package:

structure RegExp :> REGEXP = struct datatype regexp Zero | One | Char of char | Plus of regexp * regexp | Times of regexp * regexp | Star of regexp fun tokenize s ... fun parse s = let val (r, s') = parse_rexp (tokenize (String.explode s)) in case s' of nil => r | _ => raise SyntaxError "Bad input." end handle LexicalError => raise SyntaxError "Bad input." ... fun format r = String.implode (format_exp r) end

What can we understand from the above code?

  • Structures:
  • Structures, which are signature implementations, are wrapped by struct and
  • ~end~. This might be useful, when defining a structure inline or simply to avoid giving whitespace or indentation meaning.
  • Let:
  • There is a syntax form let~-~in~-~end, which creates bindings for use within
  • its ~in~ block.
  • Exception handling:
  • The keyword raise is used to raise exceptions.
  • The keyword handle is used to specify, how to handle an exception from a
  • lower layer.
  • Functions:
  • The keyword fun is used to indicate implementations of functions. After
  • ~fun~ follows the function name.
  • Functions can return multiple values (perhaps only as list or tuple) and those
  • can be destructured by using ~val~ and a tuple of binding names.
  • After the binding name follow the arguments names of a function.
  • After the argument names of a function follows the equal sign = to
  • indicate, that the function is equal to its implementation.
  • Functions apparently do not end using the end keyword. This is
  • surprising. Perhaps the ~end~ keyword merely needs not to be written in such a definition. How would one write an anonymous function inline, if it is not delimited using any keyword or special character?
  • Cases:
  • There is a keyword case for distinguishing multiple cases and the cases are
  • separated using ~|~.
  • Perhaps the underscore _ stands for a case, in which the value of the
  • binding is irrelevant.
  • The fat arrow => is used to separate cases from their consequences.
  • Strings:
  • Strings are written in double quotes.
  • There seem to be static methods of String, for example String.explode and
  • ~String.implode~.

Implementation of tokenize

datatype token AtSign | Percent | Literal of char | PlusSign | TimesSign | Asterisk | LParen | RParen exception LexicalError fun tokenize nil nil | tokenize (#"+" :: cs) = PlusSign :: tokenize cs | tokenize (#"." :: cs) = TimesSign :: tokenize cs | tokenize (#"*" :: cs) = Asterisk :: tokenize cs | tokenize (#"(" :: cs) = LParen :: tokenize cs | tokenize (#")" :: cs) = RParen :: tokenize cs | tokenize (#"@" :: cs) = AtSign :: tokenize cs | tokenize (#"%" :: cs) = Percent :: tokenize cs | tokenize (#"\\" :: c :: cs) = Literal c :: tokenize cs | tokenize (#"\\" :: nil) = raise LexicalError | tokenize (#" " :: cs) = tokenize cs | tokenize (c :: cs) = Literal c :: tokenize cs

:results: datatype token = Asterisk | AtSign | | LParen | | Literal of char | | Percent | | PlusSign | | RParen | | TimesSign | exception LexicalError val tokenize = fn : char list -> token list :end:

What can we understand from the above code?

  • Characters are written as a string with a # in front.
  • :: is used as an operator for consing onto lists.
  • SML can do recursive function calls.
  • Pattern matching can use destructuring for lists.
  • Exceptions are specified above functions, which might raise them.
  • nil is the list terminating thing and probably the empty list.
  • Backslashes in strings or as characters need to be escaped.
  • The book mentions a few concepts and associates them with the characters,
  • which are used in regular expressions to express the concepts:
  • Alternation: ~+~: choose between things
  • Concatenation: ~.~: put things together
  • Empty regular expression: ~@~: ???
  • Null string: ~%~: ???
  • Iteration: ~*~: ???
  • Lists of things are displayed as TYPE list.

The book goes on giving an unexplained dump of abbreviations in a grammar for regular expressions, which is then implemented. It seems difficult to understand it, because the abbreviations are not explained. Was it really too much work, to write out what the abbreviations stand for? The grammar is the following:

rexp :: = rtrm | rtrm+rexp rtrm :: = rfac | rfac.rtrm rfac :: = ratm | ratm* ratm :: = @ | % | a | (rexp)

What I can guess is:

  • ~rexp~: stands for regular expression
  • ~rtrm~: stands for terminal symbol, perhaps
  • ~ratm~: stands for atom, perhaps

I have no idea what rfac stands for. "Factor"? But what sense does that make?

The code, that follows the grammar is not much clearer to me.

For this reason I skip the rest of the regular expression package implementation and go on with the rest of the book.

Test lab

3; 3 + 4; 4 div 3; 4 mod 3;

val it = 7 : int