Match word boundary before non-alphanumerical character

I want to find words starting with a single non-alphanumerical character, say '$', in a string with re.findall

Example of matching words


Example of non-matching words


Why \b does not work

If the first character were to be alphanumerical, I could do this.

re.findall(r'\bA\w+', s)

But this does not work for a pattern like \b\$\w+ because \b matches the empty string only between a \w and a \W.

# The line below matches only the last '$baz' which is the one that should not be matched
re.findall(r'\b\$\w+', '$foo $bar x$baz').

The above outputs ['$baz'], but the desired pattern should output ['$foo', '$bar'].

I tried replacing \b by a positive lookbehind with pattern ^|\s, but this does not work because lookarounds must be fixed in length.

What is the correct way to handle this pattern?



answered 5 months ago Olivier Melançon #1

One way is to use a negative lookbehind with the non-whitespace metacharacter \S.

s = '$Python $foo foo$bar baz'

re.findall(r'(?<!\S)\$\w+', s) # output: ['$Python', '$foo']

answered 5 months ago Evan #2

The following will match a word starting with a single non-alphanumerical character.

(?:     # start non-capturing group
  ^         # start of string
  |         # or
  \s        # space character
)       # end non-capturing group
(       # start capturing group
  [^\w\s]   # character that is not a word or space character
  \w+       # one or more word characters
)       # end capturing group
''', s, re.X)

or just:

re.findall(r'(?:^|\s)([^\w\s]\w+)', s, re.X)

results in:

'$a $b a$c $$d' -> ['$a', '$b']

comments powered by Disqus