bash routine to return the page number of a given line number from text file

Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'):

alpha\n
beta\n
gamma\n\f
one\n
two\n
three\n
four\n
five\n\f
earth\n
wind\n
fire\n
water\n\f

Note that each page has a random number of lines.

Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character.

After a long time researching the solution I finally came across this piece of code:

function get_page_from_line
{
    local nline="$1"
    local input_file="$2"

    local npag=0
    local ln=0
    local total=0

    while IFS= read -d $'\f' -r page; do

        npag=$(( ++npag ))

        ln=$(echo -n "$page" | wc -l)

        total=$(( total + ln ))

        if [ $total -ge $nline ]; then
            echo "${npag}"
            return
        fi

    done < "$input_file"

    echo "0"

    return
}

But, unfortunately, this solution proved to be very slow in some cases.

Any better solution ?

Thanks!

The idea to use read -d $'\f' and then to count the lines is good.

This version migth appear not ellegant: if nline is greater than or equal to the number of lines in the file, then the file is read twice.

Give it a try, because it is super fast:

function get_page_from_line ()
{
    local nline="${1}"
    local input_file="${2}"    
    if [[ $(wc -l "${input_file}" | awk '{print $1}') -lt nline ]] ; then
        printf "0\n"
    else
        printf "%d\n" $(( $(head -n ${nline} "${input_file}" | grep -c "^"$'\f') + 1 ))
    fi
}

Performance of awk is better than the above bash version. awk was created for such text processing.

Give this tested version a try:

function get_page_from_line ()
{
  awk -v nline="${1}" '
    BEGIN {
      npag=1;
    }
    {
      if (index($0,"\f")>0) {
        npag++;
      }
      if (NR==nline) {
        print npag;
        linefound=1;
        exit;
      }
    }
    END {
      if (!linefound) {
        print 0;
      }
    }' "${2}"
}

When \f is encountered, the page number is increased.

NR is the current line number.

----

For history, there is another bash version.

This version is using only built-it commands to count the lines in current page.

The speedtest.sh that you had provided in the comments showed it is a little bit ahead (20 sec approx.) which makes it equivalent to your version:

function get_page_from_line ()
{
    local nline="$1"
    local input_file="$2"

    local npag=0
    local total=0

    while IFS= read -d $'\f' -r page; do
        npag=$(( npag + 1 ))
        IFS=$'\n'
        for line in ${page}
        do
            total=$(( total + 1 ))
            if [[ total -eq nline ]] ; then
                printf "%d\n" ${npag}
                unset IFS
                return
            fi
        done
        unset IFS
    done < "$input_file"
    printf "0\n"
    return
}

awk to the rescue!

awk -v RS='\f' -v n=09 '$0~"^"n"." || $0~"\n"n"." {print NR}' file

3

updated anchoring as commented below.

 $ for i in $(seq -w 12); do awk -v RS='\f' -v n="$i" 
          '$0~"^"n"." || $0~"\n"n"." {print n,"->",NR}' file; done

01 -> 1
02 -> 1
03 -> 1
04 -> 2
05 -> 2
06 -> 2
07 -> 2
08 -> 2
09 -> 3
10 -> 3
11 -> 3
12 -> 3

A script of similar length can be written in bash itself to locate and respond to the embedded <form-feed>'s contained in a file. (it will work for POSIX shell as well, with substitute for string index and expr for math) For example,

#!/bin/bash

declare -i ln=1     ## line count
declare -i pg=1     ## page count

fname="${1:-/dev/stdin}"            ## read from file or stdin

printf "\nln:pg  text\n"            ## print header

while read -r l; do                 ## read each line
    if [ ${l:0:1} = $'\f' ]; then   ## if form-feed found
        ((pg++))
        printf "<ff>\n%2s:%2s  '%s'\n" "$ln" "$pg" "${l:1}"
    else
        printf "%2s:%2s  '%s'\n" "$ln" "$pg" "$l"
    fi
    ((ln++))
done < "$fname"

Example Input File

The simple input file with embedded <form-feed>'s was create with:

$ echo -e "a\nb\nc\n\fd\ne\nf\ng\nh\n\fi\nj\nk\nl" > dat/affex.txt

Which when output gives:

$ cat dat/affex.txt
a
b
c

d
e
f
g
h

i
j
k
l

Example Use/Output

$ bash affex.sh <dat/affex.txt

ln:pg  text
 1: 1  'a'
 2: 1  'b'
 3: 1  'c'
<ff>
 4: 2  'd'
 5: 2  'e'
 6: 2  'f'
 7: 2  'g'
 8: 2  'h'
<ff>
 9: 3  'i'
10: 3  'j'
11: 3  'k'
12: 3  'l'

With Awk, you can define RS (the record separator, default newline) to form feed (\f) and IFS (the input field separator, default any sequence of horizontal whitespace) to newline (\n) and obtain the number of lines as the number of "fields" in a "record" which is a "page".

The placement of form feeds in your data will produce some empty lines within a page so the counts are off where that happens.

awk -F '\n' -v RS='\f' '{ print NF }' file

You could reduce the number by one if $NF == "", and perhaps pass in the number of the desired page as a variable:

awk -F '\n' -v RS='\f' -v p="2" 'NR==p { print NF - ($NF == "") }' file

To obtain the page number for a particular line, just feed head -n number to the script, or loop over the numbers until you have accrued the sum of lines.

line=1
page=1
for count in $(awk -F '\n' -v RS='\f' '{ print NF - ($NF == "") }' file); do
    old=$line
    ((line += count))
    echo "Lines $old through line are on page $page"
    ((page++)
done

This gnu awk script prints the "page" for the linenumber given as command line argument:

BEGIN   { ffcount=1;
      search = ARGV[2]
      delete ARGV[2]
      if (!search ) {
        print "Please provide linenumber as argument"  
        exit(1);
      }
    }

$1 ~ search { printf( "line %s is on page %d\n", search, ffcount) }

/[\f]/ { ffcount++ }

Use it like awk -f formfeeds.awk formfeeds.txt 05 where formfeeds.awk is the script, formfeeds.txt is the file and '05' is a linenumber.

The BEGIN rule deals mostly with the command line argument. The other rules are simple rules:

$1 ~ search applies when the first field matches the commandline argument stored in search
/[\f]/ applies when there is a formfeed

来源：https://stackoverflow.com/questions/36655478/bash-routine-to-return-the-page-number-of-a-given-line-number-from-text-file

标签

bash

shell

ascii

text-processing