unix to the rescue

Problem: I have 3 flat files F1, F2 and F3 with ‘|’ as the delimiter.
F1 has columns C1|CX|C2|CY|CZ
F2 has columns C1|C3|CN|CM|C4
F3 has columns C1|C3|CD|C5|CE

My requirement is to get C1, C2, C3, C4 and C5 in one flat file.
Conditions:
1> F1 can be joined with both F2 and F3 on C1.
2> F2 and F3 have to be joined on both columns C1 and C3.
3> All records of F2 should be captured.
4> C1, C2 and C3 are numerical columns.

In SQL this would somewhat look like below

Select ....
From F2 left outer join F1 on F1.C1=F2.C1
left outer join F3 on F2.C1=F3.C1 and F2.C3=F3.C3

But I need this done in Unix without loading into DB.

My Solution:

Step 1>

less F1 | cut -f 1,3 -d "|" | sort -k1 -n -t $'|' > tempF1
less F1 | cut -f 1,2,5 -d "|" | sort -k1 -n -t $'|' > tempF2
less F1 | cut -f 1,3,4 -d "|" | sort -k1 -n -t $'|' > tempF3

Explanation:
I pick the desired columns from each file then sort and store them in temp files.
tempF1 : C1|C2
tempF2 : C1|C3|C4
tempF3 : C1|C3|C5

Step 2>

join -a1 tempF2 tempF1 -t $'|' > tempF2F1

Explanation:
I do a left outer join (-a1 suggests that) on tempF2 and tempF1.
By default join would happen on their first columns (its as good as mentioning -1 1 -2 1)
tempF2F1 : C1|C3|C4|C2

Now I would like to join this tempF2F1 with tempF3 on C1 and C3, but apparently joining on multiple columns is tricky.
So I would concatenate C1 and C3 into C1C3 in both these files and then join, but obviously would also want to preserve the columns C1 and C3 separately.
This will make my life easier as I would then have to join on single column.

Step 3>

awk 'BEGIN {FS="|";OFS="|"}{$1=$2 "|" $1} 1' < tempF3 > temp1F3

Explanation:
I mention the Input/Output Field Separator (FS and OFS) and copy the 2nd column($2) which is C3 to the beginning.
This step helps in preserving C3 and is the first step to get the concatenated C1C3
temp1F3 : C3|C1|C3|C5

Step 4>

awk 'BEGIN {FS="|";OFS="|"}{$1=$2 "|" $1} 1' < temp1F3 | sed 's/|//' | sort -k1 -n -t $'|' > temp2F3

Explanation:
Here I again copy the 2nd column which is C1 to the beginning.
The awk command gives us C1|C3|C1|C3|C5 and now I can concatenate the first two columns C1 and C3 as C1C3.
I use sed substitution which would substitute the first occurence of | and replace it with nothing.
I then sort based on this concatenated column C1C3 and store the result in temp2F3
temp2F3 : C1C3|C1|C3|C5

Step 5>

sed 's/|//' tempF2F1 | sort -k1 -n -t $'|' | join - temp2F3 -a1 -t $'|' | cut -f2- -d "|" | awk 'BEGIN{FS="|";OFS="|"}{temp=$1; $1=$3; $3=temp; print; }' > FinalFile.txt

Explanation:
Firstly with the sed substitution I would concatenate C1 and C3 in file tempF2F1 making it look like C1C3|C4|C2.
I then sort based on the concatenated column, this is required else join wouldnt work.
Now I join C1C3|C4|C2 and C1C3|C1|C3|C5 and get C1C3|C4|C2|C1|C3|C5
I then would remove the column C1C3 (-f2- means cut columns starting from 2nd column) so the output I get is C4|C2|C1|C3|C5
The awk command is now to reorganize the columns (not important), but as you see, it swaps C1 and C4.

FinalFile.txt : C1|C2|C4|C3|C5

I’m pretty sure this ain’t the best solution. If you have better solution please mention in comment. Thanks

Advertisements

4 responses to “unix to the rescue

  1. Good work…though might not require the whole solution but explanation on awk and SED commands is really helpful…

  2. I found a bug in my code. Remember I had mentioned C1 and C3 are numeric columns, also my entire logic of joining on multiple columns was based on first concatenating C1 and C3 into C1C3. Guess what, when C1 is 2 and C3 is 21 I got 221 and when C1 was 22 and C3 was 1 I again got 221. Classic bug. to avoid this I made the following changes:

    Step 4>

    awk ‘BEGIN {FS=”|”;OFS=”|”}{$1=$2 “|” $1} 1’ < temp1F3 | sed ‘s/|/-/’ | sort -k1 -t $’|’ > temp2F3

    Basically I’m now concatenating C1 and C3 as C1-C3, this would eliminate the bug as I would have 2 distinct records 2-21 and 22-1.

    Note I had to remove -n in sort as its no longer a number.

    Also I would have to make similar change in Step 5..

    sed ‘s/|/-/’ tempF2F1 | sort -k1 -t $’|’ | join – temp2F3 -a1 -t $’|’ | cut -f2- -d “|” | awk ‘BEGIN{FS=”|”;OFS=”|”}{temp=$1; $1=$3; $3=temp; print; }’ > FinalFile.txt

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s