Using an NSXMLParser to parse HTML

前端未结

关注

 3  2049

I\'m working on an app which aggregates some feeds from the internet and reformats the content. So I\'m looking for a way to parse some HTML. Given XML and HTML are very sim

相关标签:

3条回答

忘掉有多难

2020-12-07 04:25

There's absolutely nothing special about "p" as the name of an element. While it is hard to be sure because you haven't provided an example of the HTML you are parsing, the problem is most likely caused by HTML that is not well-formed XML. In other words, using NSXMLParser would work on XHTML, but not necessarily plain-old HTML.

The "p" element is frequently found in HTML without the matching closing tag, which is not valid XML. My guess is that you would have to convert the HTML to XHTML before trying to parse it with an NSXMLParser

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-12-07 04:40
I recommend you use my DTHTMLParser which is modeled after NSXMLParser and uses libxml2 to parse HTML perfectly. You generally cannot rely on the HTML to be well-formed and be parseable as xml.

libxml2 has a HTML mode where it is able to ignore things like un-closed tags and whatever HTML might have in ideosyncrasies.

HTML parsing explained:
- http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/
- http://www.cocoanetics.com/2012/01/taming-html-parsing-with-libxml-2/
DTHTMLParser documentation:
- https://docs.cocoanetics.com/DTFoundation/Classes/DTHTMLParser.html
Source, part of DTFoundation:
- DTHTMLParser.h
- DTHTMLParser.m
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-12-07 04:44
HTML is not necessarily well-formed XML, and that's the trouble when you parse it as XML.

Take the following example:
```
<body>
    <p>123
    <p>abc
    <p>789
</body>
```
If you view this chunk of html in a browser, it would show just as what you expected. But if you parse this as xml, there would be trouble, as those p tags are not closed.
0 讨论(0)
发布评论:

提交评论
- 加载中...