Delete html tags in sed or similar

后端 未结 2 942
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-05 21:07

I am trying to fetch contents of table from a wepage. I jsut need the contents but not the tags . I don\'t even need \"tr\" or \"td\" just

相关标签:
2条回答
  • 2020-12-05 21:20

    sed 's/<[^>]\+>//g' will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td> becoming: onetwo. So you could do sed 's/<[^>]\+>/ /g' so it would output one two (well, actually one two).

    That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.

    As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM API is pretty good for these kinds of things.

    0 讨论(0)
  • 2020-12-05 21:28

    Original:

    Mac Terminal REGEX behaves a bit differently. I was able to do this on my Mac using the following example:

    $ curl google.com | sed 's/<[^>]*>//g'
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385
    
    301 Moved
    301 Moved
    The document has moved
    here.
    
    $ bash --version
    GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
    Copyright (C) 2007 Free Software Foundation, Inc.
    

    Edit:

    Just for clarification sake the origional looked like:

    $ curl googl.com
    <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
    <TITLE>301 Moved</TITLE></HEAD><BODY>
    <H1>301 Moved</H1>
    The document has moved
    <A HREF="http://www.google.com/">here</A>.
    </BODY></HTML>
    

    Also the annoying curl header can be rid of using the -s option:

    $ curl -s google.com | sed 's/<[^>]*>//g' 
    
    301 Moved
    301 Moved
    The document has moved
    here.
    
    $
    
    0 讨论(0)
提交回复
热议问题