sed command to extract text from HTML

时光毁灭记忆、已成空白 提交于 2019-12-11 07:27:23

问题


I am grabbing the source of a page useing curl, and want to extract a text from a specific tag.

the text is between the unique tag:

href="http://www.website.com/some/unique/page.php?q=xyz">TEXT</a>

What I did was:

curl -s "http://www.website.com" | sed 's|PATTERN|\1|'

Where PATTERN is one of the many regex patterns I have tried, one of which:

href="http://www.website.com/some/unique/page.php?q=xyz">\(.*\)</a>

But for some reasons, I couldn't get it to work. I either get the full page source or nothing at all (depending on the pattern I used).

Would appreciate any help.


回答1:


Assuming your desired output is just TEXT, this will work with the input you gave:

sed 's/^.*>\([^<]*\)<.*$/\1/'

If the only output you want is TEXT and you only want that to be output from a URL containing the word unique in it's path then use this instead:

sed -n '/http:.*\/unique\//s/^.*>\([^<]*\)<.*$/\1/p'


来源:https://stackoverflow.com/questions/4464170/sed-command-to-extract-text-from-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!