I want to be able to extract two different sequences from one line

若如初见. 提交于 2019-12-25 01:44:17

问题


I want to be able to extract two different sequences from one line.

For example:

atg ttg tca aat tca tgg atc atg ttg tca aat tca tgg atc tag

I want to create a loop where the program will read from the 1st atg to tag, output that sequence into a file, as well as take the second atg read to tag, output that sequence into the same file.

Output I want:

atg ttg tca aat tca tgg atc atg ttg tca aat tca tgg atc tag
atg ttg tca aat tca tgg atc tag

How can I go about this?

Thank you for the help.


回答1:


Would you please try the following:

str="atg ttg tca aat tca tgg atc atg ttg tca aat tca tgg atc tag"
start="atg"    # start marker of the sequence
end="tag"      # end marker of the sequence

read -r -a ary <<< "$str"
for (( i=0; i<${#ary[@]}; i++ )); do
    if [[ ${ary[$i]} = $start ]]; then
        index_s+=("$i")
    elif [[ ${ary[$i]} = $end ]]; then
        index_e+=("$i")
    fi
done

s=${index_s[0]}; n=$(( ${index_e[0]} - ${index_s[0]} + 1 ))
echo "${ary[@]:$s:$n}" > "result.txt"
s=${index_s[1]}; n=$(( ${index_e[0]} - ${index_s[1]} + 1 ))
echo "${ary[@]:$s:$n}" >> "result.txt"

Result:

atg ttg tca aat tca tgg atc atg ttg tca aat tca tgg atc tag
atg ttg tca aat tca tgg atc tag

[How it works]

  • read -r -a ary <<< "$str" splits $str on whitespaces (IFS) and stores the elements into an array ary.
  • Then the for loop iterates over the array elements for the start/end markers.
  • If the start marker atg is found, the position is stored in an array index_s. Finally ${index_s[0]} holds the first position of the start marker and ${index_s[1]} holds the second one (and so on). The same operation is performed with the end marker tag.
  • Eventually the script outputs two sets of array slice. One starts with the first atg and ends with the first tag. The other starts with the second atg and ends with the first tag.

Hope this helps.




回答2:


When you want at most 2 sequences, you can grep inside the original and a modified string:

s='atg ttg tca aat tca tgg atc atg ttg tca aat tca tgg atc tag'
printf "%s\n" "$s" "${s#*atg}" | grep -Eo "atg.*tag"

When you want to extract more than 2 substrings when available, you need a loop.

s='atg ttg tca aat tca tgg atc atg ttg tca aat tca tgg atc tag'
while [ "$s" ]; do
   s=$(grep -Eo "atg.*tag" <<< "$s")
   if [ "$s" ]; then
      echo "$s"
      s="${s#atg}"
   fi
done


来源:https://stackoverflow.com/questions/58295188/i-want-to-be-able-to-extract-two-different-sequences-from-one-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!