I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.
I thought about doing the following:
In awk you can say:
awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
In my experience awk is the fastest text processing language for most tasks and the only thing I have seen of comparable speed (on Linux systems) are programs written in C/C++.
In the code above using minimal functions and commands will allow for faster execution.
for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
this way is usually faster than using custom ones as awk is optimised
to use the default
if(int($i)) - Checks if the field is not equal to zero and as strings are set to zero
by int, does not execute the next block if the field is a string. I
believe this is the quickest way to perform this check
{a[$i]=$i} - Sets an array variable with the number as key and value. This means
there will only be as many array variables as there are numbers in
the file and will hopefully be quicker than a comparison of every
number
END{x=asort(a) - At the end of the file, use asort on the array and store the s
size of the array in x.
print a[x] - Print the last element in the array.
Mine:
time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
took
real 0m0.434s
user 0m0.357s
sys 0m0.008s
hek2mgl's:
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file
took
real 0m1.256s
user 0m1.134s
sys 0m0.019s
For those wondering why it is faster it is due to using the default FS and RS which awk is optimised for using
Changing
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'
to
awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'
provides the time
real 0m0.574s
user 0m0.497s
sys 0m0.011s
Which is still a little slower than my command.
I believe the slight difference that is still present is due to asort()
only working on around 6 numbers as they are only saved once in the array.
In comparison, the other command is performing a comparison on every single number in the file which will be more computationally expensive.
I think they would be around the same speed if all the numbers in the file were unique.
Tom Fenech's:
time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile
real 0m0.716s
user 0m0.612s
sys 0m0.013s
A drawback of this approach, though, is that if all the numbers are below zero then max will be blank.
Glenn Jackman's:
time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file
real 0m1.492s
user 0m1.258s
sys 0m0.022s
and
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
real 0m0.790s
user 0m0.686s
sys 0m0.034s
The good thing about perl -MList::Util=max -0777 -nE 'say max /-?\d+/g'
is that it is the only answer on here that will work if 0 appears in the file as the largest number and also works if all numbers are negative.
All times are representative of the average of 3 tests