I’ve been working on a custom variant of an OMeta parser for a couple weeks now. It’s coming along pretty well, I think I’ve overcome most of the major hurdles and I’m just trying to go through what I currently have, clean it up and get it to solve some of the edge cases that I need.
Just now I was working on the grammar for parsing a character and realized how hard it really is. It sounds trivial, after all it’s just two single quotes and a character right? Wrong. Here’s my current grammar:
CharacterLiteralToken = '\'' '\\' 'u' Hex#4 '\'' | '\'' '\\' 'U' Hex#8 '\'' | '\'' '\\' 'x' Hex Hex? Hex? Hex? '\'' | '\'' '\\' ('\'' | '\"' | '\\' | '0' | 'a' | 'b' | 'f' | 'n' | 'r' | 't' | 'v') '\'' | '\'' '\u0000'..'\uffff' '\'';
It turns out that you have to be sure to account for a multitude of escape characters as well as escaped Unicode literals. I didn’t want to have to implement this, but you can see the last rule which just matches every character under the sun needed it.
This will match:
- ‘\u0000’
- ‘\U00000000’
- ‘\x0’, ‘\x00’, ‘\x000’, ‘\x0000’
- ‘\”, ‘\”‘, ‘\\’, ”, ‘\a’, ‘\b’, ‘\f’, ‘\n’, ‘\r’, ‘\t’, ‘\v’
- ‘a’ …
Next I get to do the string parser… that should be even more interesting.