Any idea why sort utility gives me incorrect results?

匿名 (未验证) 提交于 2019-12-03 01:06:02

问题:

EDIT: To be clear, we got our STDOUT from a for loop that goes something like this

for (( i=1; i<="$FILE_AMOUNT"; i++ )); do     MY_FILE=`find $DIR -type f | head -$i | tail -1`     FILE_TYPE=`file -b "$MY_FILE"     FILE_TYPE_COUNT=`echo $FILE_TYPE" | sort | uniq -c`     echo "$FILE_TYPE_COUNT" done 

Hence our STDOUT is basically output from file utility printed one by one, instead of it actualling being set of strings we can copy - which is likely the core behind all of the issues




`

So there's a pickle i absolutely can't wrap my head around.

Basically i'm creating a shellscript that will print out various filetypes we have in our directory. It pretty much works, however, for some odd reason when i try to use uniq on my output, it doesnt work. This is my output

POSIX shell script, ASCII text executable ASCII text Bourne-Again shell script, ASCII text executable UTF-8 Unicode text, with overstriking Bourne-Again shell script, ASCII text executable 

Seems fairly self-explanatory, however when I use

FILE_TYPE_COUNT=`echo "$FILE_TYPE" | sort | uniq -c` 

this is the result it prints

  1 POSIX shell script, ASCII text executable   1 ASCII text   1 Bourne-Again shell script, ASCII text executable   1 UTF-8 Unicode text, with overstriking   1 Bourne-Again shell script, ASCII text executable 

Obviously it should be

  1 POSIX shell script, ASCII text executable   1 ASCII text   2 Bourne-Again shell script, ASCII text executable   1 UTF-8 Unicode text, with overstriking 

Any idea what I'm doing wrong?

Obviously uniq thinks the lines aren't different, but that's what I assume is at fault of sort, because it cant sort my STDOUT. So any clue how to sort the list properly ALPHABETICALlY?

回答1:

Your approach seem overly complicated, try this:

find $DIR -type f -exec file -b -- {} \; | sort | uniq -c 

If you'r not familiar with -exec, it executes the given command, in our case file -b -- {}, once per file. The place holder {} is replaced with the path to the file currently being processed.

Why you approach doesn't work:

You do this echo $FILE_TYPE" | sort | uniq -c within the for loop, $FILE_TYPE contains only the file type of one file at that point. You need to move the sort | uniq -c out of the loop.

I adjusted your code so it works:

declare -a TYPES=() for (( i=1; i<="$FILE_AMOUNT"; i++ )); do     MY_FILE=`find a/ -type f | head -$i | tail -1`     FILE_TYPE=`file -b "$MY_FILE"`     TYPES+=("$FILE_TYPE") # add type of current file to TYPES array done  # TYPES now contains the types of all files and we can now count them printf "%s\n" "${TYPES[@]}" | sort | uniq -c 


回答2:

The issue you are seeing is because you are sorting a set of one item, for every iteration of the loop.

You'd need to sort the whole output of the loop instead.

Your (syntactically fixed) script:

for (( i=1; i<="$FILE_AMOUNT"; i++ )); do     MY_FILE=`find $DIR -type f | head -$i | tail -1`     FILE_TYPE=`file -b "$MY_FILE"`     FILE_TYPE_COUNT=`echo "$FILE_TYPE" | sort | uniq -c`     echo "$FILE_TYPE_COUNT" done 

Mofified to work properly:

for (( i=1; i<="$FILE_AMOUNT"; i++ )); do     MY_FILE=`find $DIR -type f | head -$i | tail -1`     file -b "$MY_FILE" done | sort | uniq -c 

Optimised once:

for FILE in $(find $DIR -type f); do     file -b "$FILE" done | sort | uniq -c 

Optimised twice (See @P. Gerber's Answer):

find $DIR -type f -exec file -b -- {} \; | sort | uniq -c 

Your original script is horrifically inefficient.

Notes on efficiency & operation:

  • ${FILE_AMOUNT} has to be correct to iterate over the whole dataset
  • You are running find, which returns the whole dataset and then discarding everything that you're not interested in, every iteration
  • You are running sort and uniq, on every iteration, on a dataset of size one
  • As you are constantly re-computing your dataset, if it changes half way through your script's execution (e.g: file / directory is created / deleted), then your results will become invalid
  • Remember that every time you start a new program, you pay a performance penalty - this is exacerbated by the fact that you are continually computing your dataset and then discarding "everything that you don't want"


回答3:

In addition to the other good solutions here, be sure to understand the sorting rule set that you are using. To inspect your current sorting rule, you can do:

echo anything | sort --debug 

to see your results with annotations. Consider:

echo -e "a 2\na1" | sort --debug sort: using ‘en_US.UTF-8’ sorting rules a1 __ a 2 ___ 

Note that the rule set is sorting with perhaps an unexpected result. If you're looking for a simple byte comparison, then be sure to set LC_ALL=C as in:

LC_ALL=C sort 

For example:

echo -e "a 2\na1" | LC_ALL=C sort --debug sort: using simple byte comparison a 2 ___ a1 __ 

The use of LC_ALL is important in getting the results you expect. Lastly, run the locale command and read the man page to get locale-specific information.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!