Combining (rbind) multiple large csv files into one with different column orders

Rushabh Source

I have multiple large csv files in a folder and I am trying to rbind(concatenate) them into one csv. But, while doing this I want to make sure all column values go in an appropriate column after concatenating. I can't do this in R because of memory limitation. I am pretty much new to shell scripting and I know there might be some way to do that without taking all csv files into memory.

Eg.

> csv1
     A  B  C  D  E
     1  2  4  5  6
     4  5  7  8  9
     3  5  6  7  8
     2  3  4  5  8

> csv1
    C  B  E  D  A
    10 22 43 35 66
    14 15 37 48 99
    33 25 56 67 88

> Desired Output
         A  B  C  D  E
         1  2  4  5  6
         4  5  7  8  9
         3  5  6  7  8
         2  3  4  5  8
        66 22 10 35 43
        99 15 14 48 37
        88 25 33 67 56

My attempts:

I try to set column order in R for each file while saving and then use below code to concatenate. I want to know a way where I can do everything in linux shell.

nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > result.csv

Any help is highly appreciated.

rlinuxbashshellawk

Answers

answered 3 months ago RavinderSingh13 #1

Since I don't have large data so couldn't test it on it. Could you please try it once and let me know if this helps you.

Solution 1st: With each time checking and noting the field's numbers and then printing them in all passed Input_file(s):

awk '
FNR==NR && FNR==1{
  for(i=1;i<=NF;i++){
      b[i]=$i};
  print;
  i--;
  next
}
FNR==NR{
  print;
  next
}
FNR!=NR && FNR==1{
  for(j=1;j<=NF;j++){
      c[$j]=j};
  next
}
FNR!=NR && FNR>1{
  for(k=1;k<=i;k++){
      printf("%s%s",$c[b[k]],k==i?RS:FS)}
}
'  csv1 csv2

Output will be as follows:

A  B  C  D  E
1  2  4  5  6
4  5  7  8  9
3  5  6  7  8
2  3  4  5  8
66 22 10 35 43
99 15 14 48 37
88 25 33 67 56

Solution 2nd: checked with OP and OP is ok if I print the data in column's sorted format, so we need NOT to note down the field sequence number each time in all Input_file(s)(Also considering that fields are same in all the passed Input_file(s) too), could be MUCH FASTER I believe compare to previous solution:

awk '
BEGIN{
  PROCINFO["sorted_in"] = "@ind_num_asc"
}
FNR==1{
  for(i=1;i<=NF;i++){
     a[$i]=i};
  if(FNR==1 && FNR==NR){
     print};
  next
}
{
  for(j in a){
     printf("%s ",$a[j])}
  print ""
}
' csv1 csv2

answered 3 months ago karakfa #2

another similar awk

$ awk 'NR==1   {split($0,t)} 
       NR==FNR {print; next}  
       FNR==1  {for(i=1;i<=NF;i++) k[$i]=i; next}
               {for(i=1;i<=NF;i++) 
                   printf("%s%s", $(k[t[i]]), (i==NF?ORS:OFS))}' file1 file2 | column -t

A   B   C   D   E
1   2   4   5   6
4   5   7   8   9
3   5   6   7   8
2   3   4   5   8
66  22  10  35  43
99  15  14  48  37
88  25  33  67  56

uses the order of the column headers from the first file. Assumes matching number of columns...

comments powered by Disqus