可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:
<html> <body> <table> <tr> <td> <font > <a href="a">ABC</a></font> </td> </tr> <tr> <td height="50"> <font>When:</font><font>19-1-2013</font> <b><font> </font></b> <font>Where:</font><font>Here</font> <font>Who:</font><font>Me</font> </td> </tr> <tr> <td> <font > <a href="b">EFG</a> </font> </td> </tr> <tr> <td height="50"> <font>When:</font><font>19-2-2013</font> <b><font> </font></b> <font>Where:</font><font>There</font> <font>Who:</font><font>You</font> </td> </tr> <tr> <td> <font > <a href="c">HIJ</a> </font> </td> </tr> <tr> <td height="50"> <font>When:</font><font>19-3-2013</font><b> <font> </font></b> <font>Where:</font><font>Far away</font> <font>Who:</font><font>Him</font> </td> </tr> </table> </body> </html>
To this, after several iterations, I arrived at this code to achieve what I want:
import Data.List import Control.Arrow.ArrowNavigatableTree import Text.XML.HXT.Core import Text.HandsomeSoup group2 [] = [] group2 (x0:x1:xs) = [x0,x1]:(group2 xs) countRows html = html >>> deep (hasName "tr") >. length parsePage sz html = let n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA m = deep (hasName "td") >>> css "a" /> getText o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText p x = (((n x) >>> m) &&& ((n x) >>> o)) in html >>> catA [p x | x <- [0..sz]] main = do dt <- readFile "test.html" let html = parseHtml dt count <- (runX . countRows) html let cnt = ((head count) `div` 2) - 1 prcssd <- (runX . (parsePage cnt)) html print prcssd
And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]
However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.
The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.
回答1:
Here's a somewhat simpler implementation.
import Text.XML.HXT.Core import Text.HandsomeSoup group2 :: [a] -> [(a, a)] group2 [] = [] group2 (x0:x1:xs) = (x0, x1) : group2 xs parsePage :: ArrowXml a => a XmlTree (String, String) parsePage = let trPairs = deep (hasName "tr") >>. group2 insideLink = deep (hasName "a") /> getText insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText in trPairs >>> (insideLink *** insideFont) main = do dt <- readFile "test.html" let html = parseHtml dt prcssd <- runX $ html >>> parsePage print prcssd
The >>.
operator can be used instead of >.
so that you don't need to call unlistA
afterwards.
I changed the group2
function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.
The type of trPairs
is
trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)
i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr>
nodes). Now we can use the ***
operator from Control.Arrow
to apply a transformation to either element of the pair, insideLink
for the first one and insideFont
for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.
回答2:
I did some html parsing with hxt a few weeks ago and thought, that xpath comes in quite handy. Unfortunately, I didn't come up with a perfect solution for your problem, but it might be a start for a new try.
import Text.XML.HXT.Core import Text.XML.HXT.XPath.Arrows type XmlTreeValue a = a XmlTree String type ParsedXmlTree a = a XmlTree XmlTree type IOXmlTree = IOSArrow XmlTree XmlTree -- parses a given .html file parseHtml :: FilePath -> IOStateArrow s b XmlTree parseHtml path = readDocument [withParseHTML yes, withWarnings no] path -- "" for stdout saveHtml :: IOXmlTree saveHtml = writeDocument [withIndent yes] "" extract :: IOXmlTree extract = processChildren (process `when` isElem) -- main processing functon processHtml :: FilePath -> IO () processHtml src = runX (parseHtml src >>> extract >>> saveHtml) >> return () -- process the html structure process :: ArrowXml cat => ParsedXmlTree cat process = -- create tag <structure> for the expression given next selem "structure" -- navigate to <html><body><table><tr>... [(getXPathTrees "/html/body/table/tr") -- then combine the results >>> (getTheName <+> getWhere)] -- selects text at path <td><font><a...> </a></font></td> and creates <name>-Tag -- (// means that all <td>-tags are analysed, -- but I'm not quite sure why this is relevant here) getTheName :: ArrowXml cat => ParsedXmlTree cat getTheName = selem "name" [getXPathTrees "//td/font/a/text()"] -- selects text at path <td><font><a...> </a></font></td> -- (where the forth font-tag is taken) and creates <where>-Tag getWhere :: ArrowXml cat => ParsedXmlTree cat getWhere = selem "where" [getXPathTrees "//td/font[4]/text()"]
The result looks like this:
*Main> processHtml "test.html" <?xml version="1.0" encoding="UTF-8"?> <structure> <name>ABC</name> <where/> <name/> <where>Here</where> <name>EFG</name> <where/> <name/> <where>There</where> <name>HIJ</name> <where/> <name/> <where>Far away</where> </structure>
Like I said, not quite perfect, but hopefully a start.
EDIT: Maybe this looks more like your approach. Still, instead of dropping the elements you don't care about, we first choose all elements that fit and filter the results. I think it's quite fascinating that there's no generic approach for such a problem. Because, somehow, the font[4]-selection does not work with my other approach - but maybe I'm just not a good xpath user.
processHtml :: FilePath -> IO [(String,String)] processHtml src = do names <- runX (parseHtml src >>> process1) fontTags <- runX (parseHtml src >>> process2) let wheres = filterAfterWhere fontTags let result = zip names wheres return result where filterAfterWhere [] = [] filterAfterWhere xs = case dropWhile (/= "Where:") xs of [] -> [] [x] -> [x] _:y:ys -> y : filterAfterWhere ys process1 :: ArrowXml cat => XmlTreeValue cat process1 = textNodeToText getTheName process2 :: ArrowXml cat => XmlTreeValue cat process2 = textNodeToText getWhere getTheName :: ArrowXml cat => ParsedXmlTree cat getTheName = getXPathTrees "//td/font/a/text()" getWhere :: ArrowXml cat => ParsedXmlTree cat getWhere = getXPathTrees "//td/font/text()" -- neet function to select a value within a XmlTree as String textNodeToText :: ArrowXml cat => ParsedXmlTree cat -> XmlTreeValue cat textNodeToText selector = selector `when` isElem >>> getText
This way, you get the result you showed in your question:
*Main> processHtml "test.html" [("ABC","Here"),("EFG","There"),("HIJ","Far away")]
Edit2:
Fun fact: it seems like the hxt-xpath library does not work quite right for such an index-selection. An online XPath-evaluator shows the right behaviour for //td/font[4]/text()
.