read in bash on whitespace-delimited file without empty fields collapsing

前端 未结 5 1954
无人共我
无人共我 2020-11-28 14:52

I\'m trying to read a multi-line tab-separated file in bash. The format is such that empty fields are expected. Unfortunately, the shell is collapsing together field separat

相关标签:
5条回答
  • 2020-11-28 15:00

    Here's a fast and simple function I use that avoids calling external programs or restricting the range of input characters. It works in bash only (I guess).

    If it is to allow for more variables than fields, though, it needs to be modified along Charles Duffy's answer.

    # Substitute for `read -r' that doesn't merge adjacent delimiters.
    myread() {
            local input
            IFS= read -r input || return $?
            while [[ "$#" -gt 1 ]]; do
                    IFS= read -r "$1" <<< "${input%%[$IFS]*}"
                    input="${input#*[$IFS]}"
                    shift
            done
            IFS= read -r "$1" <<< "$input"
    }
    
    0 讨论(0)
  • 2020-11-28 15:05

    Here's an approach with some niceties:

    • input data from wherever becomes a pseudo-2D array in the main code (avoiding a common problem where the data is only available within one stage of a pipeline).
    • no use of awk, tr, or other external progs
    • a get/put accessor pair to hide the hairier syntax
    • works on tab-delimited lines by using param matching instead of IFS=

    The code. file_data and file_input are just for generating input as though from a external command called from the script. data and cols could be parameterized for the get and put calls, etc, but this script doesn't go that far.

    #!/bin/bash
    
    file_data=( $'\t\t'       $'\t\tbC'     $'\tcB\t'     $'\tdB\tdC'   \
                $'eA\t\t'     $'fA\t\tfC'   $'gA\tgB\t'   $'hA\thB\thC' )
    file_input () { printf '%s\n' "${file_data[@]}" ; }  # simulated input file
    delim=$'\t'
    
    # the IFS=$'\n' has a side-effect of skipping blank lines; acceptable:
    OIFS="$IFS" ; IFS=$'\n' ; oset="$-" ; set -f
    lines=($(file_input))                    # read the "file"
    set -"$oset" ; IFS="$OIFS" ; unset oset  # cleanup the environment mods.
    
    # the read-in data has (rows * cols) fields, with cols as the stride:
    data=()
    cols=0
    get () { local r=$1 c=$2 i ; (( i = cols * r + c )) ; echo "${data[$i]}" ; }
    put () { local r=$1 c=$2 i ; (( i = cols * r + c )) ; data[$i]="$3" ; }
    
    # convert the lines from input into the pseudo-2D data array:
    i=0 ; row=0 ; col=0
    for line in "${lines[@]}" ; do
        line="$line$delim"
        while [ -n "$line" ] ; do
            case "$line" in
                *${delim}*) data[$i]="${line%%${delim}*}" ; line="${line#*${delim}}" ;;
                *)          data[$i]="${line}"            ; line=                     ;;
            esac
            (( ++i ))
        done
        [ 0 = "$cols" ] && (( cols = i )) 
    done
    rows=${#lines[@]}
    
    # output the data array as a matrix, using the get accessor
    for    (( row=0 ; row < rows ; ++row )) ; do
       printf 'row %2d: ' $row
       for (( col=0 ; col < cols ; ++col )) ; do
           printf '%5s ' "$(get $row $col)"
       done
       printf '\n'
    done
    

    Output:

    $ ./tabtest 
    row  0:                   
    row  1:                bC 
    row  2:          cB       
    row  3:          dB    dC 
    row  4:    eA             
    row  5:    fA          fC 
    row  6:    gA    gB       
    row  7:    hA    hB    hC 
    
    0 讨论(0)
  • 2020-11-28 15:12

    I've written a function which works around this issue. This particular implementation is particular about tab-separated columns and newline-separated rows, but that limitation could be removed as a straightforward exercise:

    read_tdf_line() {
        local default_ifs=$' \t\n'
        local n line element at_end old_ifs
        old_ifs="${IFS:-${default_ifs}}"
        IFS=$'\n'
    
        if ! read -r line ; then
            return 1
        fi
        at_end=0
        while read -r element; do
            if (( $# > 1 )); then
                printf -v "$1" '%s' "$element"
                shift
            else
                if (( at_end )) ; then
                    # replicate read behavior of assigning all excess content
                    # to the last variable given on the command line
                    printf -v "$1" '%s\t%s' "${!1}" "$element"
                else
                    printf -v "$1" '%s' "$element"
                    at_end=1
                fi
            fi
        done < <(tr '\t' '\n' <<<"$line")
    
        # if other arguments exist on the end of the line after all
        # input has been eaten, they need to be blanked
        if ! (( at_end )) ; then
            while (( $# )) ; do
                printf -v "$1" '%s' ''
                shift
            done
        fi
    
        # reset IFS to its original value (or the default, if it was
        # formerly unset)
        IFS="$old_ifs"
    }
    

    Usage as follows:

    # read_tdf_line one two three rest <<<$'one\t\tthree\tfour\tfive'
    # printf '<%s> ' "$one" "$two" "$three" "$rest"; printf '\n'
    <one> <> <three> <four       five>
    
    0 讨论(0)
  • 2020-11-28 15:13

    It's not necessary to use tr, but it is necessary that IFS is a non-whitespace character (otherwise multiples get collapsed to singles as you've seen).

    $ IFS=, read -r one two three <<<'one,,three'
    $ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
    <one> <> <three>
    
    $ var=$'one\t\tthree'
    $ var=${var//$'\t'/,}
    $ IFS=, read -r one two three <<< "$var"
    $ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
    <one> <> <three>
    
    $ idel=$'\t' odel=','
    $ var=$'one\t\tthree'
    $ var=${var//$idel/$odel}
    $ IFS=$odel read -r one two three <<< "$var"
    $ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
    <one> <> <three>
    
    0 讨论(0)
  • 2020-11-28 15:16

    Sure


    IFS=,
    echo $'one\t\tthree' | tr \\11 , | (
      read one two three
      printf '<%s> ' "$one" "$two" "$three"; printf '\n'
    )
    

    I've rearranged the example just a bit, but only to make it work in any Posix shell.

    Update: Yeah, it seems that white space is special, at least if it's in IFS. See the second half of this paragraph from bash(1):

       The shell treats each character of IFS as a delimiter, and  splits  the
       results of the other expansions into words on these characters.  If IFS
       is unset, or its value is exactly <space><tab><newline>,  the  default,
       then  any  sequence  of IFS characters serves to delimit words.  If IFS
       has a value other than the default, then sequences  of  the  whitespace
       characters  space  and  tab are ignored at the beginning and end of the
       word, as long as the whitespace character is in the value  of  IFS  (an
       IFS whitespace character).  Any character in IFS that is not IFS white-
       space, along with any adjacent IFS whitespace  characters,  delimits  a
       field.   A  sequence  of IFS whitespace characters is also treated as a
       delimiter.  If the value of IFS is null, no word splitting occurs.
    
    0 讨论(0)
提交回复
热议问题