Capture word between optional hyphens regex

Abhishek Source

I've following type of strings,

abc - xyz
abc - pqr - xyz
abc - - xyz
abc - pqr uvw - xyz

I want to retrieve the text xyz from 1st string and pqr from 2nd string, `` (empty) from 3rd & pqr uvw. The 2nd hyphen is optional. abc is static string, it has to be there. I've tried following regex,

/^(?:abc) - (.*)[^ -]?/

But it gives me following output,

xyz
pqr - xyz
- xyz
pqr uvw - xyz

I don't need the last part in the second string. I'm using perl for scripting. Can it be done via regex?

regexperlhyphen

Answers

answered 3 months ago Wiktor Stribi┼╝ew #1

Note that (.*) part is a greedily quantified dot and it grabs any 0+ chars other than line break chars, as many as possible, up to the end of the line and the [^ -]?, being able to match an empty string due to the ? quantifier (1 or 0 repetitions), matches the empty string at the end of the line. Thus, pqr - xyz output for abc - pqr - xyz is only logical for the regex engine.

You need to use a more restrictive pattern here. E.g.

/^abc\h*-\h*((?:[^\s-]+(?:\h+[^\s-]+)*)?)/

See the regex demo.

Details

  • ^ - start of a string
  • abc - an abc
  • \h*-\h* - a hyphen enclosed with 0+ horizontal whitespaces
  • ((?:[^\s-]+(?:\h+[^\s-]+)*)?) - Group 1 capturing an optional occurrence of
    • [^\s-]+ - 1 or more chars other than whitespace and -
    • (?:\h+[^\s-]+)* - zero or more repetitions of
      • \h+ - 1+ horizontal whitespaces
      • [^\s-]+ - 1 or more chars other than whitespace and -

answered 3 months ago PJProudhon #2

You could use ^[^-]*-\s*\K[^\s-]*.

Here's how it works:

^       # Matches at the beginning of the line (in multiline mode)
[^-]*   # Matches every non - characters
-       # Followed by -
\s*     # Matches every spacing characters
\K      # Reset match at current position
[^\s-]* # Matches every non-spacing or - characters

Demo.


Update for multiple enclosed words: ^[^-]*-\s*\K[^\s-]*(?:\s*[^\s-]+)*

Last part (?:\s*[^\s-]+)* checks for existence of any other word preceded by space(s).

Demo

answered 3 months ago 7stud #3

Can it be done via regex?

Yes, with three simple regexes: - and ^\s+ and \s+$.

use strict;
use warnings; 
use 5.020;
use autodie;
use Data::Dumper;

open my $INFILE, '<', 'data.txt';

my @results = map {
    (undef, my $target) = split /-/, $_, 3;
    $target =~ s/^\s+//;  #remove leading spaces
    $target =~ s/\s+$//;  #remove trailing spaces
    $target;
} <$INFILE>;

close $INFILE;

say Dumper \@results;

--output:--
$VAR1 = [
          'xyz',
          'pqr',
          '',
          'pqr uvw'
        ];

answered 3 months ago Willa #4

You could use split:

$answer = (split / \- /, $t)[1];

Where $t is the text string and you want the 2nd split (i.e. [1] as starts from 0). Works for everything except abc - - xyz but if the separator is " - " then it should have 2 spaces in the middle to return nothing. If abc - - xyz is correct then you can do this before the split for all to work:

$t =~ s/\- \-/-  -/;

It simply inserts an extra space so it'll match " - " twice with nothing in-between.

comments powered by Disqus