Wed, 31 Dec 2008

Parsec and Expression Parsing.

Haskell's Parsec parsing library (distributed with the GHC compiler since at least version 6.8.2) contains a very powerful expression parsing module Text.ParserCombinators.Parsec.Expr. This module makes it easy to correctly parse arithmetic expressions like the following according to the usual programming language precedence rules.


  2 + a * 4 - b

Once parsed, the abstract syntax tree for that expression should look like this:


hackfest living room

The original Parsec paper by Daan Leijen even gives a small example of using the expression parser. As can be seen below, the operators are defined in a table (ie a list of lists) where the outer list is a set precedence levels (from highest precedence to lowest) and each inner list specifies all the operators for that precedence level.


  module Parser
  where
  
  import Text.ParserCombinators.Parsec
  import Text.ParserCombinators.Parsec.Expr
  
  expr :: Parser Integer
  expr = buildExpressionParser table factor <?> "expression"

  table :: [[ Operator Char st Integer ]]
  table = [
      [ op "*" (*) AssocLeft, op "/" div AssocLeft ],
      [ op "+" (+) AssocLeft, op "-" (-) AssocLeft ]
      ]
    where
      op s f assoc = Infix (do { string s ; return f }) assoc
  
  factor = do { char '(' ; x <- expr ; char ')' ; return x }
     <|> number
     <?> "simple expression"
  
  number :: Parser Integer
  number = do { ds <- many1 digit; return (read ds) } <?> "number"

The above simple example works fine but there is a potential for a rather subtle problem if the expression parser is used in conjunction with the Text.ParserCombinators.Parsec.Token module.

The problem arises when trying to parse expressions in C-like languages which have bitwise OR (|) as well as logical OR (||) and where the bitwise operation has higher precedence than the logical. The symptom was that code containing the logical OR would fail because the expression parser was finding two bitwise ORs. After banging my head against this problem for a considerable time, I posted a problem description with a request for clues to the Haskell Cafe mailing list with the following code snippet:


  import qualified Text.ParserCombinators.Parsec.Expr as E

  opTable :: [[ E.Operator Char st Expression ]]
  opTable = [
      -- Operators listed from highest precedence to lowest precedence.

      {- snip, snip -}

      [    binaryOp "&" BinOpBinAnd E.AssocLeft ],
      [    binaryOp "^"  BinOpBinXor E.AssocLeft ],
      [    binaryOp "|"  BinOpBinOr E.AssocLeft ],

      [    binaryOp "&&" BinOpLogAnd E.AssocLeft ],
      [    binaryOp "||" BinOpLogOr E.AssocLeft ]
      ]

  binaryOp :: String -> (SourcePos -> a -> a -> a)
                     -> E.Assoc -> E.Operator Char st a
  binaryOp name con assoc =
      E.Infix (reservedOp name >>
          getPosition >>=
          return . con) assoc

Unfortunately, no real answer was forthcoming and while I still held out hope that an answer might eventuate, I continued to work on the problem.

Eventually, after reasoning about the problem and looking at the source code to the Token module I realised that the problem was with the behaviour of the reservedOp combinator. This combinator was simply matching the given string at the first precedence level it found so that even if the code contained a logical OR (||) the higher precedence bitwise OR would match leaving the second vertical bar character un-parsed.

My first attempt at a solution to this problem was to define my own combinator reservedOpNf using Parsec's notFollowedBy combinator and use that in place of the problematic reservedOp.


  opChar :: String
  opChar = "+-/%*=!<>|&^~"

  reservedOpNf :: String -> CharParser st ()
  reservedOpNf name = try (reservedOp name >> notFollowedBy (oneOf opChar))

This solved the immediate problem of distinguishing between bitwise and logical OR, but I soon ran into another problem parsing expressions like this:


  if (whatever == -1)
     .....

which were failing at the unary minus.

A bit of experimenting suggested that again, the problem was with the behaviour of the reservedOp combinator. In this case it seemed the combinator was matching the given string, consuming any trailing whitespace and then the notFollowedBy was failing on the unary minus.

Once the problem was understood, the solution was easy, replace the reservedOp combinator with the string combinator (which matches the string exactly and doesn't consume trailing whitespace), followed by the notFollowedBy and then finally use the whiteSpace combinator to chew up trailing white space.


    reservedOpNf :: String -> CharParser st ()
    reservedOpNf name =
        try (string name >> notFollowedBy (oneOf opChar) >> whiteSpace)

Despite minor problems like the one above, I am still incredibly impressed with the ease of use and power of Parsec. Over the years I have written numerous parsers, in multiple host languages, using a number of parser generators and I have never before used anything that comes close to matching Haskell and Parsec.

Posted at: 07:28 | Category: CodeHacking/Haskell | Permalink

Mon, 29 Dec 2008

learnHaskell :: Problem -> IO Solution

Over the last week or so I've been learning Haskell. Unlike a previous attempt to learn Erlang I think this one will actually end up bearing fruit. The main thing that slowed down my attempt to learning Erlang was that I didn't have a task to apply it to that exploited the strengths of the language.

For Haskell I have a task that is ideally suited to the language and plays to its strengths; writing a parser for an Javascript-like language. Just like when I learned Ocaml I am jumping in at the deep end with a difficult problem and learning the language on the way (ie as a side-effect of tackling the task itself).

Normally my tool of choice for this task would be Ocaml and I did in fact start out writing a parser in Ocaml using the Menhir parser generator. Menhir creates a parser from an LR(1) grammar specification where the LR(1) means that the grammar is left recursive and must need no more than one token look-ahead to resolve ambiguities. Over two days of intense hacking I made quite a lot of progress on the grammar and then hit a wall; numerous rather incomprehensible shift/reduce and reduce/reduce warnings plus a part of the grammar (raw XML embedded directly into the source code) which was not context free and for which I could not see any possible way of parsing with only one token look-ahead.

Since I already knew Ocaml, I looked around at some of the other parser generators for Ocaml like Aurochs and PCL. Aurochs uses Parsing Expression Grammar (or Packrat parsing) and seemed interesting, but was rejected because it seemed to gave very little control over the Abstract Syntax Tree it generated. PCL on the other hand was a monadic parser combinator library which is basically a port of Haskell's Parsec library to Ocaml. The main problem I saw with PCL was that it used Monads and lazy evaluation, neither of which are very common in standard Ocaml code. PCL was also rather new and I wasn't sure if it was stable and mature enough for quite an advanced parsing task.

Of course I had been hearing good things about Haskell's Parsec library for a couple of years and decided that this was a good time to give it a try. The interesting thing about Parsec is that it uses lazy evaluation to provide what is effectively infinite token look-ahead. Obviously, a Parsec grammar should be written so that it only uses more than one look-ahead when that is absolutely necessary.

With a bit of googling I found a bunch of examples using Parsec for simple things and that was enough code to get me started. A week later, after an IRC question answered by Don Stewart, a bit of help from Conrad Parker and a question answered on the Haskell Cafe mailing list, I am almost as far along as I got with the Ocaml version of the parser.

My impressions of Haskell and Parsec so far are as follows:

All in all, I'm really enjoying my foray into the world of Haskell.

Posted at: 09:53 | Category: CodeHacking/Haskell | Permalink