Sort a text file by line length including spaces

后端 未结 11 2143
故里飘歌
故里飘歌 2020-11-27 11:21

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Exampl         


        
11条回答
  •  一向
    一向 (楼主)
    2020-11-27 11:38

    Here is a multibyte-compatible method of sorting lines by length. It requires:

    1. wc -m is available to you (macOS has it).
    2. Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
    3. testfile has a character encoding matching your locale (e.g., UTF-8).

    Here's the full command:

    cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
    

    Explaining part-by-part:

    • l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
    • cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
    • cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
    • close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
    • sub(/ */, "", c); ← trims white space from the character count value returned by wc.
    • { print c, $0 } ← prints the line's character count value, a space, and the original line.
    • | sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
    • | cut -d" " -f2- ← removes the prepended character count values.

    It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

    Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

提交回复
热议问题