How to access the prefix when using uniq -c

问题

I encountered a problem in my program. I have a list of files and I sort them with this code to find out the 10 most frequent file types in the list.

find $DIR -type f | file -b $SAVEFILES | cut -c1-40 | sort -n | uniq -c | sort -nr | head -10

My output looks like this

    168 HTML document, ASCII text
    114 C source, ASCII text
    102 ASCII text
     33 ASCII text, with very long lines
     30 HTML document, UTF-8 Unicode text, with 
     26 HTML document, ASCII text, with very lon
     21 C source, UTF-8 Unicode text
     20 LaTeX document, UTF-8 Unicode text, with
     15 SVG Scalable Vector Graphics image
     12 LaTeX document, ASCII text, with very lo

What I want to do is to access the values before the file types and replace them #. I can fdo that with a for loop but first I have somehow access them.

the expected output is something like this:

   __HTML document, ASCII text               : ################
   __C source, ASCII text                    : ###########
   __ASCII text                              : ##########
   __ASCII text, with very long lines        : ########
   __HTML document, UTF-8 Unicode text, with : #######
   __HTML document, ASCII text, with very lon: ####
   __C source, UTF-8 Unicode text            : #### 
   __LaTeX document, UTF-8 Unicode text, with: ###
   __SVG Scalable Vector Graphics image      : #
   __LaTeX document, ASCII text, with very lo: #

EDIT: The # are not representing the exect number in my example. First line should have 168 #, second 114 # and so on

回答1:

Append this:

| while read -r n text; do printf "__%s%$((48-${#text}))s: " "$text"; for ((i=0;i<$n;i++)); do printf "%s" "#"; done; echo; done

Change 48 according to your needs.

Output with your input:

__HTML document, ASCII text                       : ########################################################################################################################################################################
__C source, ASCII text                            : ##################################################################################################################
__ASCII text                                      : ######################################################################################################
__ASCII text, with very long lines                : #################################
__HTML document, UTF-8 Unicode text, with         : ##############################
__HTML document, ASCII text, with very lon        : ##########################
__C source, UTF-8 Unicode text                    : #####################
__LaTeX document, UTF-8 Unicode text, with        : ####################
__SVG Scalable Vector Graphics image              : ###############
__LaTeX document, ASCII text, with very lo        : ############

回答2:

A shell loop is never the right way to manipulate text, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

You can do what you asked for with this awk command:

$ awk '{printf "%-40s: %s\n", substr($0,9), gensub(/ /,"#","g",sprintf("%*s",$1,""))}' file
HTML document, ASCII text               : ########################################################################################################################################################################
C source, ASCII text                    : ##################################################################################################################
ASCII text                              : ######################################################################################################
ASCII text, with very long lines        : #################################
HTML document, UTF-8 Unicode text, with : ##############################
HTML document, ASCII text, with very lon: ##########################
C source, UTF-8 Unicode text            : #####################
LaTeX document, UTF-8 Unicode text, with: ####################
SVG Scalable Vector Graphics image      : ###############
LaTeX document, ASCII text, with very lo: ############

but the right way to do this is to get rid of everything from cut on and just do something like this:

find "$DIR" -type f | file -b "$SAVEFILES" |
awk '
{ types[substr($0,1,40)]++ }
END {
    PROCINFO["sorted_in"] = "@ind_num_desc"
    for (type in types) {
        printf "%-*s: %s\n", 40, type, gensub(/ /,"#","g",sprintf("%*s",cnt[type],""))
        if (++cnt == 10) {
            break
        }
    }
}
'

The above use GNU awk for sorted_in and gensub() and the 2nd one is untested since you only provided sample input for the last part, printing the "#"s

回答3:

The perl approach, add:

| perl -lpE 's/\s*(\d+)\s(.*)/sprintf "__%-40s: %s", $2, "#"x$1/e'

output

__HTML document, ASCII text               : ########################################################################################################################################################################
__C source, ASCII text                    : ##################################################################################################################
__ASCII text                              : ######################################################################################################
__ASCII text, with very long lines        : #################################
__HTML document, UTF-8 Unicode text, with : ##############################
__HTML document, ASCII text, with very lon: ##########################
__C source, UTF-8 Unicode text            : #####################
__LaTeX document, UTF-8 Unicode text, with: ####################
__SVG Scalable Vector Graphics image      : ###############
__LaTeX document, ASCII text, with very lo: ############

following @Ed's approach, just using perl

find "$DIR" -type f | file -b "$SAVEFILES" |\
  perl -lnE '$s{substr$_,0,40}++;}{printf"__%-40s: %s\n",$_,"#"x$s{$_}for(splice@{[sort{$s{$b}<=>$s{$a}}keys%s]},0,9)'

readable:

perl -lnE '
$seen{ substr $_,0,40 }++;
END {
   printf"__%-40s: %s\n", $_, "#" x $seen{$_}
      for( splice @{[sort { $seen{$b} <=> $seen{$a} } keys %seen]},0,9 )
}'

Ps: Just note, the file utility just will test the files in the $SAVEFILES so, the find ... | file -b $SAVEFILES is pointless

来源：https://stackoverflow.com/questions/43028856/how-to-access-the-prefix-when-using-uniq-c

标签

dash-shell