Fluent Pattern Matching with MetaSharp

My most recent work on MetaSharp has been focused on changing the pattern matching code to be, essentially, more functional in style. I had a few features I wanted to enable, specifically parameterized rules and virtual rules. In order to do this I needed to come up with a simpler and more functional approach.

I won’t bore you with all the details but at a high level what I used to be doing was to create a class for every rule you declared in a grammar. Using MEF I was exporting this class and at runtime the grammar was composing itself using all of these imports. It seemed like a good idea at the time but it had some serious limitations and it wasn’t very fast.

Parameterized Rules are any rule that accepts parameters, sort of like a macro. Here is an example of a basic rule without parameters:

Example = A | B;

Example is the name of the rule and it matches A or B (undefined in this example). A parameterized rule looks more like this:

Example(A, B) = A | B;

In this case A and B are parameters of the Rule. If you wanted to Match on this example you could do it like this:

X = Example('a'..'z', 'A'..'Z');

In this case the rule is named X and it matches on the parameterized rule Example, passing in characters ranges a to z and capital a to z. Doing this with the old system was nearly impossible.

Virtual Rules are another interesting new feature. In MetaSharp you declare all of your Rules in a grammar. Grammars are somewhat analogous to a class and as such can inherit from each other. A rule defined in a super grammar can be overriden by sub grammar. This allows for a grammar builder to be able to reuse complex grammars but selectively override what they want.

For example suppose you have a grammar like this:

grammar Foo:
  X = Start 'a'..'z' End;
  Start = '{';
  End = '}';
end

But suppose you didn’t want curly brackets but instead you wanted square brackets. You can override the Start and End rules without requiring any changes or duplication of the rules that reference them. For example:

grammar Bar < Foo:
  override Start = '[';
  override End = ']';
end

In this case we have a grammar called Bar that inherits from Foo ( the < denotes the flow of the grammar pipeline, not truly like inheritance) and overrides just the Start and End rules to use square brackets instead of their previous behavior.

After working on this feature for a while I realized some other very interesting side-effects of this change:

  • speed
  • parallel parsing
  • Fluent API for manual grammars

I believe these more functional grammars should be faster than the older style though I can’t prove it just yet. And parallel parsing is still theoretical at this point but I believe that it should be possible to parse Or expressions in parallel. Doing this you could theoretically go from a linear parse time to a sub-linear parse time, despite the duplication of rule evaluation in some cases. I still need to investigate this more. But, perhaps the most interesting side effect of all is the fluent API I ended up with!

It turns out you can create your very own Patterns programmatically without a lot of effort now. All you need is the set of “primitive” patterns I have created plus the Rule pattern in case you need recursion or you would like memoization. It can end up being very similar to how you might use a regular expression except with some interesting differences.

For example, suppose I want to naively parse a telephone number:

var digit = Range('0', '9');
var pattern = Or(
    ExactQuantity(digit, 7),
    And(ZeroOrOne(Value(1)), ExactQuantity(digit, 10)));

This C# is equivalent to the following MetaSharp grammar code:

digit = '0'..'9';
pattern = digit#7 | 1? digit#10;

Except it’s written inline with your other C# code, all you need is a reference to a single library. You can even do projections directly inline with lambda expressions. This is the very same code that the Grammar templates generate in fact. More than likely you’ll want to use the Grammar DSL but I could foresee some interesting usages of this API. Making it simple and more elegant also makes me feel like I’m getting closer to where I need to be. I’m pretty happy with where things are right now.

After some bug fixing and unit tests the next step in the master plan is to strip out the Lang, CodeDom and Template libraries (!!!) and replace them with CCI and T4. The result of this should be a drastically reduced code base! And I believe the CCI is much more powerful to boot, it should allow me to compile grammars truly dynamically. Then, finally, comes developer tooling…

Author: justinmchase

I'm a Software Developer from Minnesota.

Leave a Reply

%d bloggers like this: