Coding an awk command inside a perl script

Greyson B Source

I have two types of tab separated input files, the first is a matrix which has names listed vertically in the first column, and numerical values in subsequent columns. The second type of input contains a single column with a subset of the same names listed in the first column of the first file type.

EX: input1

Gary 1 2 3
Yolanda 3 4 5
Biff 5 6 7
Hubert 8 9 10

EX: input2

Gary
Biff 

While there are several different variations on input2, there is only a single input1. I have a perl script with an embedded awk command which is supposed to match names from input2 to input1 and print an output file which contains the names from input2 and the respective values from input1.

EX: outputfile

Gary 1 2 3
Biff 5 6 7

Here is my code:

#!/usr/bin/perl

use strict;
use warnings;

my $dir1 = '../FeatureSelection/Chunks/ArffPreprocessing';
my $dir2 = '../DataFiles';

opendir(DIR, $dir1) or die $!;
while (my $file = readdir(DIR)) {

    # We only want files
    next unless (-f "$dir1/$file");

    # Use a regular expression to find files with .txt
    next unless ($file =~ m/\.txt/);

    my @partialName = (split /\./, $file);

    #The $matchingFile is the file which contains attributes listed vertically, along side their respective data

    my $matchingFile = "$dir2/input1\.txt ";

    system("awk -F\"\t\" 'FILENAME==\"$dir1/$file\"{a[\$1]=\$1} FILENAME==\"$matchingFile\"{if(a[\$1]){print \$0}}' $dir1/$file $matchingFile > $dir1/$partialName[0]'\_matched.out' ");

}

closedir(DIR);
exit 0;

This is the line works on the command line, but it refuses to work in my perl script.

awk -F"\t" 'FILENAME=="input2.txt"{a[$1]=$1} FILENAME=="../../../DataFiles/input1.txt"{if(a[$1]){print $0}}' input2.txt ../../../DataFiles/input1.txt > input2_matched.out

By the way, the sheer number of input2 files makes hard coding the above awk line on the command propt a real pain in the butt, which is why I have utilized a perl script which can perform my desired function on every input2 file in the directory, AND keep the naming convention for the output files. I've written similar programs so I know the syntax of

system("awk ...blah blah... ");

can and does work properly.

I've been stuck on this problem for days now, so any help would be most appreciated!

perlawk

Answers

answered 2 years ago sjsam #1

While there are several different variations on input2, there is only a single input1. I have a perl script with an embedded awk command which is supposed to match names from input2 to input1 and print an output file which contains the names from input2 and the respective values from input1.

I would suggest find + a comparison function to achieve your objective

matcher(){
awk 'NR==FNR{input1record[$1]=$0;next}
    $1 in input1record{print input1record[$1]}' /path/to/input1 "[email protected]" >> /path/to/result
}
export -f matcher
find /path/to/input2_files -type f -name "input2" \
     -exec bash -c 'matcher "[email protected]"' _ {} +

References

  1. The {} + with find builds the command line and execute the subshell command , our function in this case, once for all. See [ find ] manpage.

  2. Note the I have used >> to append the output of subsequent runs to the output file. If this is not desired use >.

  3. The pattern with -name should be adjusted to match all the input2 filenames

comments powered by Disqus