Count number of records and remove new line character in between a record in csv file(size almost 132 GB) using shell/awk/perl

diksha ojha Source

I'm unable to count the number of records and remove new line characters which is present in between a single record and append this file output to another code. Csv file looks like

RandomName,FileName,Date,OwnerName
"f","df",10/12/1298,"dgds"
-13,"fg
dhd
fd
f",10/22/1029,"dvg 
tr
-456
3gf"
"123","fd13",13/23/1245,"13
sdg
fsdg"
dv,"Df",12/12/3455,"adf"

Expected Output

RandomName,FileName,Date,OwnerName
"f","df",10/12/1298,"dgds"
-13,"fgdhdfdf",10/22/1029,"dvgtr-4563gf"
"123","fd13",13/23/1245,"13sdgfsdg"
dv,"Df",12/12/3455,"adf"

The file is 132GB in size.I'm using this solution-

perl -0777 -pe 's/((?:,"|(?!^)\G)[^",\n]*)\n/\1/g; s/,\n/,/' "${dir}" | wc -l

But Its throwing kernel soft lockup error. I have shell/awk/perl in my server. My File can contain-

  1. any number of records
  2. size <= 132 GB
  3. file can contains special characters.($,@,#,*,-,_,%)
  4. new line character can occur more than once in a single record.

Kindly help me in finding solution for printing the output to console and to another csv file as well. Thanks in advance.

bashshellperlcsvawk

Answers

answered 6 months ago Borodin #1

You just need to use the Text::CSV_XS module with the binary option enabled. This will allow quoted fields to contain control characters including CR and LF. The _XS suffix indicates that the module has a substantial C component, and so will provide a solution that is the the most optimum available without writing the whole thing in C

This program expects the input file as a parameter on the command line

You don't say anything about the output that you want, so I have used the Data::Dump module to display the result of using Text::CSV_XS to parse each row of your example data

use strict;
use warnings 'all';

use Data::Dump 'pp';
use Text::CSV_XS;

my ( $csv_file ) = @ARGV or die "CSV File parameter missing";

open my $fh, '<', $csv_file or die qq{Unable to open "$csv_file" for input: $!};

my $csv = Text::CSV->new( {
    binary => 1,
} );

my $num_records = 0;

while ( my $row = $csv->getline( $fh ) ) {

    print pp($row), "\n\n";

    ++$num_records;
}

printf "Total of %d %s\n\n",
        $num_records,
        $num_records == 1 ? 'record' : 'records';

output

["RandomName", "FileName", "Date", "OwnerName"]

["f", "df", "10/12/1298", "dgds"]

[-13, "fg\ndhd\nfd\nf", "10/22/1029", "dvg \ntr\n-456\n3gf"]

[123, "fd13", "13/23/1245", "13\nsdg\nfsdg"]

["dv", "Df", "12/12/3455", "adf"]

Total of 5 records

comments powered by Disqus