Parser Combinators are Easy
Let’s say we’ve been sent some brand new points. However, the Point Guru is having a burst of ‘creativity’ today and has devised a crazy transmission string:
This is clearly bonkers and you shouldn’t have to put up with it. Sadly, she’s your only connect for points in sets of varying sizes, though, and the points themselves look alright, so you have to roll up your sleeves and get ’em outta there.
I don’t know about you, but I (until now!) have always sighed and reached for a regular expression at this point, or started mucking about with string manipluations. It’ll be ugly as hell, but it’ll work. You can pull out each list with capture groups and then either use another regex on the captures or use string splitting and iterators to get what you need. It likely won’t be much fun, and will be completely illegible at a glance at the end (unless regex is really your thing).
BUT WAIT! There’s another way! And it’s even easier than it sounds!
I’m using the Parsimmon library to illustrate, but there are many others for JS and lots of other languages have libraries for this as well.
With Parsimmon, we create a “language” that contains mini parsers, composed of ever smaller parsers. Here’s a very basic example:
This code can be verified by using the following below:
Running node index.js should output 23 - not '23'. We’ve parsed a number! Now we can use this parser in bigger parsers. The next natural unit to look at is the point - [8 76]. Two numbers separated by a space.
The P.seq() combinator is used to chain combinators together in a sequence to match. This time the r we pass as an argument is short for rules and allows us to refer to the other combinators defined in this language. Then we just use the P.string() combinator to match the separators exactly, and use our r.Num combinator to handle recognizing and converting the numbers themselves. Then over in the map, we are passed an array of each part of the match. We ignore the brackets and the space returning by the P.string() combinators and just return the values our Num combinator took care of for us. Change the test snippet to:
Executing this will now return [ 78, 3 ]. Now, these points are further grouped into sets of varying size and (inexplicably) separated by the string ']['. We can create a mini parser for just that separator and then leverage the sepBy() combinator to handle these sets:
We don’t need to include the map portion on our Sep parser - we just want to return the match as is (it’ll be discarded later). In our PointSet parser, r.Point.seqBy(r.Sep) will return zero or more Points separated by whtever seaparater we provide as an array, dropping the separators themselvles. Try it out:
This will output [ [ 2, 3 ], [ 6, 2 ], [ 1, 2 ] ]. We’re almost there! The full string is just a bunch of PointSets, separated by that same separator with some frilly caps on each end:
And that’s it! Our parser will now successfully parse the whele input string, in only a handful of lines. Here’s the whole snippet:
We can even get fancy - just replace our Point combinator with:
Now we get:
This parser is easy to poke and prod at, or swap out components entirely - each part works independently of each other part.
There are libraries for parser combinators in a number of langauges - here’s an example of what PointSet might look like in Rust using combine, assuming we’ve already defined sep() and point() parsers:
Syntax aside it’s the same thing - composing arbitrary amounts of arbitrarily small parsers to parse any format you’d like. For Rust, there’s also nom which leverages macros instead of traits but at the end of the day it’s all the same good stuff.
Got a favorite parser combinator library? Let me know about it!