Understanding asMSX Volume 2: lex.l

Published 08-19-2016 23:27:42

Welcome back to the mystery machine!

Today, the spooky lex.l. As we saw in the previous post this file is full with constants, but to be more precise this is the file where you tell how to treat every string found. In fact, constants are not defined in this file but in dura.y,

For example:

  • “ld              return MNEMO_LD”: Whenever you find a string that’s which characters ld, treat is as a token with value  MNEMO_LD (Mnemotechnic for Load instruction).
  • “0x[0-9a-f]+ yylval.val=(int)strtol(yytext,NULL,16);return NUMERO;”: Whenever you find 0x followed by 1 or more characters from the set {0-9a-f} (hence, hex), transform them to a long int (as strol can treat hex values directly) and then to an int and set it to the yylvval, which is the flex’s return value, and return that this token is a number.

So as you have seen, at least in this case, Flex is the one that is in charge of understanding the meaning of each token as defined in this file.

Except for a some lines found in the last part of the document that I haven’t understood yet, the rest of the document follows more or less one of the two previous examples.

Finally I believe that I have found an issue on the tokenizer. If we take a close look to the following line we will notice something that could be wrong:

.?db/[ \t]+ return PSEUDO_DB;

If we take a look to flex’s manual, we can see the following phrase:

A ‘.’ inside ‘[]’’s just means a literal‘.’ (period), and NOT “any character except newline”.

In this case .?db will match with adb, xdb, zdb, 3db and so on as it means “match any character except newline”, it doesn’t mean that it will have optionally a dot. If this is true, a way to fix this is to just set [.]?db/[ \t]+, but I have to test it yet.

Moreover there are some regex (regex stand for regular expression) that could be simplified. E.g.:

[a-z_]+[a-z0-9_]* yylval.tex=yytext;return IDENTIFICADOR;

Is equal to:

[a-z_][a-z0-9_]* yylval.tex=yytext;return IDENTIFICADOR;

In this regex we are forcing that a label must start with a letter or an underscore, but after that we can also use numbers, therefore we do not need to have the + (one or more elements) in that expression, as the second and following elements will be also covered by the more general [a-z0-9_]*.

Step by step everything of this great work is becoming clearer!

See you next time!


P.S.: One mystery of the file: \042 or \42 refers to the charater .