Antti-Juhani Kaijanaho




Operator-precedence parsers, yuck!

[Originally, the title was "Operator-precedence parsers considered harmful". But that's too old:)]

I spent a better part of today fixing grep-dctrl's new predicate parser (see #227543).

Grep-dctrl has nowadays the ability to handle arbitrarily complex[1] boolean queries (the boolean query expression is called predicate in grep-dctrl lingo). I handled this by writing a home-grown operator-precedence parser on top of libc's argp routines. It worked for valid command lines, most of the time. The problem was that people kept inventing invalid command lines, which the parser did not reject; instead it constructed nonsense internal representations of the predicate, which triggered SIGSEGVs and SIGABRTs from all over the place (see for example #241766). I added all sorts of diagnostic code to shoot down any malformed internal representations, but that strategy is doomed to fail (they just keep inventing better incantations to make it crash).

The culprit is, of course, the use of operator-precedence parsing. It is a bottom-up parsing technique based on the use of operator precedence relations to drive the parsing; it is discussed in many elementary compiler books. It is known to be a fragile algorithm, in that it is known to accept too much, which is what I tripped on. Its deceiving attractiveness comes from its simplicity; it is easy to write an operator-precedence parser. It's just very, very hard to make it robust.

Never, ever write an operator-precedence parser!

I just committed a fix to this (not yet uploaded); I rewrote the parser so that the argp routines just gather a list of tokens. A separate pass on the token list is done using a recursive-descent predictive parser. I am now much more confident about the parser.

[1] ... up to certain arbitrary implementation limits, which should be big enough for most uses.

23:51 - /en/programming - 0 comments