How to extract data from html table in shell script?

前端 未结 6 1456
[愿得一人]
[愿得一人] 2020-11-30 11:42

I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data:

6条回答
  •  粉色の甜心
    2020-11-30 12:07

    A solution based on multi-platform web-scraping CLI xidel and XQuery:

    xidel -s --xquery 'for $tr in //tr[position()>1] return join($tr/td, " ")' file
    

    With the sample input, this yields:

    SAVE_DOCUMENT OK 0.406 s
    GET_DOCUMENT OK 0.332 s
    DVK_SEND OK 0.001 s
    DVK_RECEIVE OK 0.001 s
    GET_USER_INFO OK 0.143 s
    NOTIFICATIONS OK 0.001 s
    ERROR_LOG OK 0.001 s
    SUMMARY_STATUS OK 0.888 s
    

    Explanation:

    • XQuery query for $tr in //tr[position()>1] return join($tr/td, " ") processes the tr elements starting with the 2nd one (position()>1, to skip the header row) in a loop, and joins the values of the child td elements ($tr/td) with a single space as the separator.

    • -s makes xidel silent (suppresses output of status information).


    While html2text is convenient for display of the extracted data, providing machine-parseable output is non-trivial, unfortunately:

    html2text file | awk -F' *\\|' 'NR>2 {gsub(/^\||.\b/, ""); $1=$1; print}'
    

    The Awk command removes the hidden \b-based (backspace-based) sequences that html2text outputs by default, and parses the lines into fields by |, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use -v OFS='\t').

    Note: Use of -nobs to suppress backspace sequences at the source is not an option, because you then won't be able to distinguish between the hidden-by-default _ instances used for padding and actual _ characters in the data.

    Note: Given that html2text seemingly invariably uses | as the column separator, the above will only work robustly if the are no | instances in the data being extracted.

提交回复
热议问题