Wed, 31 Dec 2008
Parsec and Expression Parsing.
Haskell's Parsec parsing library (distributed with the GHC compiler since at least version 6.8.2) contains a very powerful expression parsing module Text.ParserCombinators.Parsec.Expr. This module makes it easy to correctly parse arithmetic expressions like the following according to the usual programming language precedence rules.
2 + a * 4 - b
Once parsed, the abstract syntax tree for that expression should look like this:
The original Parsec paper by Daan Leijen even gives a small example of using the expression parser. As can be seen below, the operators are defined in a table (ie a list of lists) where the outer list is a set precedence levels (from highest precedence to lowest) and each inner list specifies all the operators for that precedence level.
module Parser
where
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
expr :: Parser Integer
expr = buildExpressionParser table factor <?> "expression"
table :: [[ Operator Char st Integer ]]
table = [
[ op "*" (*) AssocLeft, op "/" div AssocLeft ],
[ op "+" (+) AssocLeft, op "-" (-) AssocLeft ]
]
where
op s f assoc = Infix (do { string s ; return f }) assoc
factor = do { char '(' ; x <- expr ; char ')' ; return x }
<|> number
<?> "simple expression"
number :: Parser Integer
number = do { ds <- many1 digit; return (read ds) } <?> "number"
The above simple example works fine but there is a potential for a rather subtle problem if the expression parser is used in conjunction with the Text.ParserCombinators.Parsec.Token module.
The problem arises when trying to parse expressions in C-like languages which have bitwise OR (|) as well as logical OR (||) and where the bitwise operation has higher precedence than the logical. The symptom was that code containing the logical OR would fail because the expression parser was finding two bitwise ORs. After banging my head against this problem for a considerable time, I posted a problem description with a request for clues to the Haskell Cafe mailing list with the following code snippet:
import qualified Text.ParserCombinators.Parsec.Expr as E
opTable :: [[ E.Operator Char st Expression ]]
opTable = [
-- Operators listed from highest precedence to lowest precedence.
{- snip, snip -}
[ binaryOp "&" BinOpBinAnd E.AssocLeft ],
[ binaryOp "^" BinOpBinXor E.AssocLeft ],
[ binaryOp "|" BinOpBinOr E.AssocLeft ],
[ binaryOp "&&" BinOpLogAnd E.AssocLeft ],
[ binaryOp "||" BinOpLogOr E.AssocLeft ]
]
binaryOp :: String -> (SourcePos -> a -> a -> a)
-> E.Assoc -> E.Operator Char st a
binaryOp name con assoc =
E.Infix (reservedOp name >>
getPosition >>=
return . con) assoc
Unfortunately, no real answer was forthcoming and while I still held out hope that an answer might eventuate, I continued to work on the problem.
Eventually, after reasoning about the problem and looking at the source code to the Token module I realised that the problem was with the behaviour of the reservedOp combinator. This combinator was simply matching the given string at the first precedence level it found so that even if the code contained a logical OR (||) the higher precedence bitwise OR would match leaving the second vertical bar character un-parsed.
My first attempt at a solution to this problem was to define my own combinator reservedOpNf using Parsec's notFollowedBy combinator and use that in place of the problematic reservedOp.
opChar :: String opChar = "+-/%*=!<>|&^~" reservedOpNf :: String -> CharParser st () reservedOpNf name = try (reservedOp name >> notFollowedBy (oneOf opChar))
This solved the immediate problem of distinguishing between bitwise and logical OR, but I soon ran into another problem parsing expressions like this:
if (whatever == -1)
.....
which were failing at the unary minus.
A bit of experimenting suggested that again, the problem was with the behaviour of the reservedOp combinator. In this case it seemed the combinator was matching the given string, consuming any trailing whitespace and then the notFollowedBy was failing on the unary minus.
Once the problem was understood, the solution was easy, replace the reservedOp combinator with the string combinator (which matches the string exactly and doesn't consume trailing whitespace), followed by the notFollowedBy and then finally use the whiteSpace combinator to chew up trailing white space.
reservedOpNf :: String -> CharParser st ()
reservedOpNf name =
try (string name >> notFollowedBy (oneOf opChar) >> whiteSpace)
Despite minor problems like the one above, I am still incredibly impressed with the ease of use and power of Parsec. Over the years I have written numerous parsers, in multiple host languages, using a number of parser generators and I have never before used anything that comes close to matching Haskell and Parsec.
Posted at: 07:28 | Category: CodeHacking/Haskell | Permalink
Mon, 29 Dec 2008
learnHaskell :: Problem -> IO Solution
Over the last week or so I've been learning Haskell. Unlike a previous attempt to learn Erlang I think this one will actually end up bearing fruit. The main thing that slowed down my attempt to learning Erlang was that I didn't have a task to apply it to that exploited the strengths of the language.
For Haskell I have a task that is ideally suited to the language and plays to its strengths; writing a parser for an Javascript-like language. Just like when I learned Ocaml I am jumping in at the deep end with a difficult problem and learning the language on the way (ie as a side-effect of tackling the task itself).
Normally my tool of choice for this task would be Ocaml and I did in fact start out writing a parser in Ocaml using the Menhir parser generator. Menhir creates a parser from an LR(1) grammar specification where the LR(1) means that the grammar is left recursive and must need no more than one token look-ahead to resolve ambiguities. Over two days of intense hacking I made quite a lot of progress on the grammar and then hit a wall; numerous rather incomprehensible shift/reduce and reduce/reduce warnings plus a part of the grammar (raw XML embedded directly into the source code) which was not context free and for which I could not see any possible way of parsing with only one token look-ahead.
Since I already knew Ocaml, I looked around at some of the other parser generators for Ocaml like Aurochs and PCL. Aurochs uses Parsing Expression Grammar (or Packrat parsing) and seemed interesting, but was rejected because it seemed to gave very little control over the Abstract Syntax Tree it generated. PCL on the other hand was a monadic parser combinator library which is basically a port of Haskell's Parsec library to Ocaml. The main problem I saw with PCL was that it used Monads and lazy evaluation, neither of which are very common in standard Ocaml code. PCL was also rather new and I wasn't sure if it was stable and mature enough for quite an advanced parsing task.
Of course I had been hearing good things about Haskell's Parsec library for a couple of years and decided that this was a good time to give it a try. The interesting thing about Parsec is that it uses lazy evaluation to provide what is effectively infinite token look-ahead. Obviously, a Parsec grammar should be written so that it only uses more than one look-ahead when that is absolutely necessary.
With a bit of googling I found a bunch of examples using Parsec for simple things and that was enough code to get me started. A week later, after an IRC question answered by Don Stewart, a bit of help from Conrad Parker and a question answered on the Haskell Cafe mailing list, I am almost as far along as I got with the Ocaml version of the parser.
My impressions of Haskell and Parsec so far are as follows:
- Initially I disliked Haskell's syntax even more than I disliked that of Ocaml, but I am coming to terms with it. I'm even quite liking it.
- The error messages from the Haskell compiler are very much more wordy than those from the Ocaml compiler but not a whole lot more enlightening.
- You don't really need to be comfortable with monads to start cutting Haskell code.
- The two unusual language features of Haskell, monads and lazy evaluation make things like Parsec possible. Without both of these features, Parsec would not be possible.
- Hoogle, the Haskell API search engine is a fantastic resource.
- Parsec is fantastic to use, amazingly powerful and very complete (whenever I find myself writing too much code its usually because there is a already a combinator to do what I need).
- Writing the parser in the language itself (ie Haskell) is so much nicer than writing it in a separate Domain Specific Language like one does with yacc / bison / ocamlyacc / menhir etc.
- Parsec error and warning messages are much easier to understand and debug than the typical shift/reduce and reduce/reduce messages from yacc style parser generators (Menhir has a conflicts output file which helps a little).
- When the parser finds an error in the input its parsing, its much easier to generate good error messages than it is with yacc style parsers.
All in all, I'm really enjoying my foray into the world of Haskell.
Posted at: 09:53 | Category: CodeHacking/Haskell | Permalink