Monday, August 13, 2007

diffing files in different way

The diff utility compares the content of file1 and file2 and writes to standard output a list of changes necessary to convert file1 into file2. diff may not be used when you want to find out which lines in file1 are not present in file2 and vice-a-versa.

Let's say you have a expected result's output file and a current result's output file,

expected-output.txt:
12 something some blah
15 ok ok and not ok
14 someone at somewhere
20 and many more such records

current-output.txt:
15 ok ok and not ok
13 this is not present in expected output
20 and many more such records

One quick way to do this is using power of Unix piping, sed, sort and uniq commands.

cat expected-output.txt | sed 's/^/expected-output.txt /g' > mixed.txt
cat current-output.txt | sed 's/^/current-output.txt /g' >> mixed.txt
sort +1 mixed.txt | uniq -u -f1 | sort

I would need to test this for files with really large number of records. But, currently I am satisfied with the above solution.

Saturday, August 11, 2007

Don't take my stdin Mr. rsh

My colleague came to me and said "I have a script which reads a file line by line and doing some processing for each line, but strangely it is processing only the first line and exiting after that". I said "OK, I guess there must be something in your processing part that is eating all your stdin or a exit command for some condition. Let me see your script".

Just to give you an idea his script was something like this,

while read line
do
   # blah blah
   rsh <remote_server_name> ls /home/someone/something.txt
   # blah blah
done < input.txt

While I was looking at his script he said "When I comment this part, it processes all the lines and when I uncomment then it processes only the first line." OK, so what's there in that commented part. Bingo!, there is a rsh command. So, rsh seems to be a culprit.

After reading more about rsh, I found that rsh command cannot tell whether the remote program that it is going to run, need to read from stdin or not. Therefore, rsh copies stdin across the network to the remote program by absorbing ALL of the input (i.e. till EOF to know when to stop).

"OK, so you would need to store your stdin (which is mapped to input.txt) before executing rsh and restore it after that. Let us try something like this.", I said.

exec 4<&0
while read line
do
   # blah blah
   exec 5<&0
   exec 0<&4
   rsh <remote_server_name> ls /home/someone/something.txt
   exec 4<&0
   exec 0<&5
   # blah blah
done < input.txt

Somehow it didn't work. I am not sure why it didn't work. May be I would need to update this post once I find its reason. Then, I thought why don't we map /dev/null as stdin and let Mr. rsh eat that, and bingo! it worked.

exec 4</dev/null
while read line
do
   # blah blah
   exec 5<&0
   exec 0<&4
   rsh <remote_server_name> ls /home/someone/something.txt
   exec 4</dev/null
   exec 0<&5
   # blah blah
done < input.txt

My colleague was happy, because he got one more thing to mention in his weekly report :). I was happy too, for successfully helping and especially for learning something new by the process of that.

We should always help others, there is no better way to learn than that.