Perl-REGEXP How to match substring from words w/o alternate patterns?

Good afternoon all,

I have a string of blank separated words. I need to find the words from that string that matches an alphanumeric pattern, partial or whole word. I need words made only of alphanumeric characters.

To make my purpose clearer I have the string:

'foo bar quux foofoo foobar fooquux barfoo barbar barquux ' .
'quuxfoo quuxbar quuxquux [foo] (foo) {foo} foofoo barfoo ' .
'quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo'

and I want to find all words with 'foo' inside (only once per word) but not those with special characters (non alpha) like "[foo]", "{foo}"...

I have done this with the following piece of code in Perl:

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';
my @m = ($s=~/(\w+foo|foo\w+|^foo|foo$)/g) ;
say "@m";
say "Number of sub-strings matching the pattern: ", scalar @m;
print( sprintf("%02d: ",$_),
       ($s=~/(\w+foo|foo\w+|^foo|foo$)/g)[$_],
       qq(\n) )
    for ([email protected]);

I get the result I want:

foo foofoo foobar fooquux barfoo quuxfoo foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo
Number of sub-strings matching the pattern: 15 
00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

But if I need (and I will) to add more patterns to search for in a more complex string it quickly becomes messy and I get confused with the succession of alternate patterns ('|').

Is there is someone to help me writing a shorter/cleaner pattern regexp to delimit the 'foo' (or any other) word/sub-word in a way that it could be written in one single pattern?

Thank you in advance.

GM

Strawberry 5.022 on W7/64, but I think it's fairly generic to any Perl above 5.016 or even 5.008;


I found the solution of dawg (and steffen too) suitable for me. Not the most readable, the grep one is more in accordance with my level of Perl, but I think, as pure regexp based, more able to handle future add of words with word limits handling.

$s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g


(?:(?<=\h)|^)  Assert either after a \h (horizontal space) or at start of line ^
(\w*foo\w*)    Capture a 'word' with 'foo' and only \w characters (or, [a-zA-Z0-9_] characters)
(?=\h|$)       Assert before either a \h horizontal space or end of line $

I would like to write here down what I understood of it so that you can correct me if I'm wrong before I intend to expand it for my actual needs.

(?:         # You start a non capturing group.
(?<=        # You start a lookbehind (so non capturing BY NATURE, am I right ?, because
            # if not, as it is being enclosed in round-brackets '()' it restarts to be
            # capturing even inside a non capturing group, isn't it?)
 \h         # In the lookbehind you look for an horizontal space (could \s have been used
            # there?)
 ^          # in the non capturing group but outside of the lookbehind you look for the
            # start of string anchor. Must not be present in the lookbehind group because
            # it requires a same length pattern size and ^ has length==0 while \h is
            # non zero.
\w*foo\w*   # You look for foo within an alphanum word. No pb to have '*' rather than '+'
            # because your left (and right, that we'll see it down) bound has been well
            # restricted.
(?=         # You start a lookforward pattern (non capturing by nature here again, right?),
            # to look for:
\h or $     # horiz space or end of string anchor. However the lookaround size is
            # different here as $ is still 0 length (as ^ anchor) and \h still non
            # zero. "AND YET IT MOVES" (I tested your regexp and it worked) because
            # only the lookbehind has the 'same-size' pattern restriction, right?

Thank you for your help, all of you, after that last point I won't bother you any longer with my little problems and consider my question fully answered. G.

regexperl

Answers

answered 6 days ago dawg #1

Perhaps filter the unwanted words first then use grep against the filtered words:

use strict;
use warnings;

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';

my @words = ( $s=~/(?:(?<=\h)|^)(\w+)(?=\h|$)/g );

my @foos = grep(/foo/, @words);

while (my ($i, $v) = each @foos) {
    printf "%02d: %s\n", $i,$v;
}

Prints:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

Alternatively, you can combine the filtering on a list of the words split by horizontal spaces and testing the resulting word is all alphanumeric:

@foos=grep {/foo/ && /^\w+$/} split /\h/, $s;  # same result

Or,

@foos=grep {/^\w*foo\w*$/} split /\h/, $s; 

Or, in a single regex:

@foos=($s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g);

As requested in comments, with:

$s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g


(?:(?<=\h)|^)  Assert either after a \h (horizontal space) or at start of line ^
(\w*foo\w*)    Capture a 'word' with 'foo' and only \w characters (or, [a-zA-Z0-9_] characters)
(?=\h|$)       Assert before either a \h horizontal space or end of line $

The only tricky part is (?:(?<=\h)|^). It is illegal in Perl to have a non-fixed width lookback such as (?<=\h|^) since ^ is a zero width and \h is not. (The regex (?<=\h|^) is legal in the PCRE library interestingly.) So (?:(?<=\h)|^) breaks the two assertion into one group.

answered 6 days ago steffen #2

It depends: if you want to get foobar from (foobar), it's easy. You just match foo with optional word characters before and after, and then on both sides a word boundary \b (which could be begin of input or some non-word character):

my @m = ($s=~/(\b\w*foo\w*\b)/g);
print( sprintf("%02d: ",$_),
    ($s=~/(\b\w*foo\w*\b)/g)[$_],
    qq(\n) )
for ([email protected]);

Output:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foo
07: foo
08: foo
09: foofoo
10: barfoo
11: quuxfoo
12: foo2foo
13: foo2bar
14: foo2quux
15: foo2foo
16: bar2foo
17: quux2foo

If not, then it's a bit more difficult. Here I'd match begin-of-input or a space, then foo surrounded by optional word characters and then we need a (zero-length) assertion which requires a whitespace or end-of-input:

my @m = ($s=~/(?:^|\s)(\w*foo\w*)(?=\s|$)/g);
print( sprintf("%02d: ",$_),
    ($s=~/(?:^|\s)(\w*foo\w*)(?=\s|$)/g)[$_],
    qq(\n) )
for ([email protected]);

Output:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

answered 6 days ago Casimir et Hippolyte #3

You can split your string and filter the array:

use strict;
use warnings;

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';

my @res = grep {/foo/ && !/\W/}  split /\s/, $s;

print join(" ", @res);

comments powered by Disqus