Surprisingly Difficult to Parse a Character

I’ve been working on a custom variant of an OMeta parser for a couple weeks now. It’s coming along pretty well, I think I’ve overcome most of the major hurdles and I’m just trying to go through what I currently have, clean it up and get it to solve some of the edge cases that I need.

Just now I was working on the grammar for parsing a character and realized how hard it really is. It sounds trivial, after all it’s just two single quotes and a character right? Wrong. Here’s my current grammar:

CharacterLiteralToken
    = '\'' '\\' 'u' Hex#4 '\''
    | '\'' '\\' 'U' Hex#8 '\''
    | '\'' '\\' 'x' Hex Hex? Hex? Hex? '\''
    | '\'' '\\' ('\'' | '\"' | '\\' | '0' | 'a' | 'b' | 'f' | 'n' | 'r' | 't' | 'v') '\''
    | '\'' '\u0000'..'\uffff' '\'';

It turns out that you have to be sure to account for a multitude of escape characters as well as escaped Unicode literals. I didn’t want to have to implement this, but you can see the last rule which just matches every character under the sun needed it.

This will match:

  • ‘\u0000’
  • ‘\U00000000’
  • ‘\x0’, ‘\x00’, ‘\x000’, ‘\x0000’
  • ‘\”, ‘\”‘, ‘\\’, ”, ‘\a’, ‘\b’, ‘\f’, ‘\n’, ‘\r’, ‘\t’, ‘\v’
  • ‘a’ …

Next I get to do the string parser… that should be even more interesting.

Author: justinmchase

I'm a Software Developer from Minnesota.

%d bloggers like this: