regexp for html tags with Matlab

后端 未结 4 1828
南旧
南旧 2020-12-20 18:05

I\'m looking for a way to use regexp in order to remove all html tags from a string.
So if I have Hello&

相关标签:
4条回答
  • 2020-12-20 18:33

    To match such a tag

    <[^>]*>
    

    See online here at Rubular

    0 讨论(0)
  • 2020-12-20 18:34

    It is widely accepted that using regexes to parse general html is bad form. If your html is much more complicated than the example given, then you should use an XML parser instead.

    Further discussion in this famous SO question. RegEx match open tags except XHTML self-contained tags.

    If you want to parse the content properly, then download xml_io_tools and use

    doc = xml_read('test.html')
    doc.b.FONT.CONTENT
    

    If you want to stick with regexes, then use ilya's answer, but with one of the regexes from the linked answer, e.g.,

    str = '<HTML><b><FONT color="red" size="3">Hello</FONT></b></HTML>';
    rx = '<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>';
    regexprep(str, rx, '')
    
    0 讨论(0)
  • 2020-12-20 18:37

    My solution is:

    >> str='<HTML><b><FONT color="red" size="3">Hello</FONT></b></HTML>';
    >> regexprep(str, '<.*?>','')
    
    ans =
    
    Hello
    
    0 讨论(0)
  • 2020-12-20 18:47

    Since you mentioned that you want to extract "hello" from the above html (say filename.html) file, you can use the following in MATLAB:

    doc = xmlread('filename.html'); content = doc.item(0).getTextContent

    Hope this helps!

    0 讨论(0)
提交回复
热议问题