Perl compatible regular expression to test which of two words comes first

a3nm Source

I am given a string containing a comma-separated list of words (where whitespace and case are not significant) and I want a Perl regexp to test the following: the string contains the (complete) word "french" and the (complete) word "english" does not occur earlier. For instance, I want to accept "french", "foobar, french", "bar, french, quux, english", "french, english, french"; but reject "foo, bar", "english, french", "foo, english, bar, french, english".

My goal is to use a regexp of this kind in a lighttpd configuration. To be precise, I want to parse Accept-Language headers, with the naive heuristics that languages are listed in decreasing preference order, which is often true although not prescribed by the RFC. Hence, I can only have a Perl compatible regular expression, I cannot use any other features of Perl.

In terms of formal language theory, such a regular expression must exist, but the straightforward solution requires regexp negation, which is painful to perform. (This is why I ask the question with "french" and "english" rather than "fr" and "en", where regexp negation would be tedious but doable by hand.) Are there any Perl-specific regexp features to make it possible to write a concise regexp for my task, or is there a tool to automatically compile a regexp to perform this?



answered 4 years ago sln #1

Something like this should work

Fail on first 'English' before 'French' only its:

 # /(?i)^(?:(?!\benglish\b).)*?\bfrench\b/

 (?i)                          # Case insensitive
 ^                             # BOS
      (?! \b english \b )
 \b french \b                  # 'french'

Fail on any 'English' before 'French'

 # /(?i)^(?!.*\benglish\b.*\bfrench\b).*\bfrench\b/

 (?i)                          # Case insensitive
 ^                             # BOS
 (?!                           # Not 'english' .. 'french'
      \b english \b 
      \b french \b 
 \b french \b                  # Must contain 'french' 

comments powered by Disqus