Extract part of the code and parse HTML in bash

后端未结

关注

 2  340

深忆病人 2020-12-10 15:11

I have external HTML site and I need to extract data from the table on that site. However source of the HTML website has wrong formatting except the table in the code, so I

2条回答

刺人心 (楼主)

2020-12-10 15:50
I will break down the answer which I tried using xmllint which supports a --html flag for parsing html files

Firstly you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-
```
$ xmllint --html YourHTML.html





Lorem ipsum ....
  
    
      Company
      Contact
    
  
... dolor.
```
with my original YourHTML.html file just being the input HTML file in your question.

Now for the value extraction part:-

Starting the file parsing from root-node to the table node (//html/body/table) and running xmllint in HTML parser & interactive shell mode (xmllint --html --shell)

Running the command plainly produces a result,
```
$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html
/ >  -------

    
      Company
      Contact
    
  
/ > 
```
Now removing the special characters using sed i.e. sed '/^\/ >/d' produces
```
$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d'

    
      Company
      Contact
    
  
```
which is the output structure as you expected. Tested on xmllint: using libxml version 20900

I will go one more step ahead, and if you want to fetch the values within the table tag, you can apply the sed command to extract them as
```
$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | xargs
Company Contact
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...