Longest common prefix of two strings in bash

前端 未结 13 2311
一向
一向 2020-12-03 00:49

I have two strings. For the sake of the example they are set like this:

string1=\"test toast\"
string2=\"test test\"

What I want is to find

13条回答
  •  时光取名叫无心
    2020-12-03 01:23

    This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.

    longest_common_prefix () {
      local prefix= n
      ## Truncate the two strings to the minimum of their lengths
      if [[ ${#1} -gt ${#2} ]]; then
        set -- "${1:0:${#2}}" "$2"
      else
        set -- "$1" "${2:0:${#1}}"
      fi
      ## Binary search for the first differing character, accumulating the common prefix
      while [[ ${#1} -gt 1 ]]; do
        n=$(((${#1}+1)/2))
        if [[ ${1:0:$n} == ${2:0:$n} ]]; then
          prefix=$prefix${1:0:$n}
          set -- "${1:$n}" "${2:$n}"
        else
          set -- "${1:0:$n}" "${2:0:$n}"
        fi
      done
      ## Add the one remaining character, if common
      if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
      printf %s "$prefix"
    }
    

    The standard toolbox includes cmp to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.

    longest_common_prefix () {
      local LC_ALL=C offset prefix
      offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
      if [[ -n $offset ]]; then
        offset=${offset%,*}; offset=${offset##* }
        prefix=${1:0:$((offset-1))}
      else
        if [[ ${#1} -lt ${#2} ]]; then
          prefix=$1
        else
          prefix=$2
        fi
      fi
      printf %s "$prefix"
    }
    

    Note that cmp operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.

    longest_common_prefix () {
      local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
      offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32)
                                               <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
      if [[ -n $offset ]]; then
        offset=${offset%,*}; offset=${offset##* }
        prefix=${1:0:$((offset/4-1))}
      else
        if [[ ${#1} -lt ${#2} ]]; then
          prefix=$1
        else
          prefix=$2
        fi
      fi
      printf %s "$prefix"
    }
    

提交回复
热议问题