sabato 23 luglio 2011

Ignore a word in a regular expression

Today a friend of mine asked me a tricky regexp related question: he wanted to match against a set of strings like:

WORD_FOO
WORD_BAR
WORD_FOOBAR
WORD_QUUX
WORD_FAR

He did want to match and store any string starting with WORD_ and followed by a valid word, unless it was FOO. So it should match all the aforementioned lines but WORD_FOO.

Tricky.

He was using something like /^(WORD_[^\s]+)$/, so I suggested using [^\s|VERSION] (I did wake up only a few moments before), but of course that doesn't work, since it would exclude all strings containing the characters V, E, R, S, I, O, N.

It took me some digging, but finally I found this answer on StackOverflow that documents the use of negative look-arounds.

Using these constructs I managed to get this regex: WORD_((?!FOO\W)\S+) that satisfies the requirements (you can check it on Rubular).

How does it work?


(?!FOO\W) checks the next characters of the string. If they DON'T (!) contain the word FOO followed by a non-word (whitespace, etc) character (\W), then the matching will be made against \S+ (one or more non whitespace characters). So you'll get the second part of the word in your \1, $1, etc.

If you want to ignore all sub-strings starting with FOO, you can get rid of that \W.

Nessun commento:

Posta un commento