I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:
<html>
<body>
<table>
<tr>
<td>
<font >
<a href="a">ABC</a></font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-1-2013</font>
<b><font> </font></b>
<font>Where:</font><font>Here</font>
<font>Who:</font><font>Me</font>
</td>
</tr>
<tr>
<td>
<font >
<a href="b">EFG</a>
</font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-2-2013</font>
<b><font> </font></b>
<font>Where:</font><font>There</font>
<font>Who:</font><font>You</font>
</td>
</tr>
<tr>
<td>
<font >
<a href="c">HIJ</a>
</font>
</td>
</tr>
<tr>
<td height="50">
<font>When:</font><font>19-3-2013</font><b>
<font> </font></b>
<font>Where:</font><font>Far away</font>
<font>Who:</font><font>Him</font>
</td>
</tr>
</table>
</body>
</html>
To this, after several iterations, I arrived at this code to achieve what I want:
import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 [] = []
group2 (x0:x1:xs) = [x0,x1]:(group2 xs)
countRows html = html >>> deep (hasName "tr") >. length
parsePage sz html = let
n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA
m = deep (hasName "td") >>> css "a" /> getText
o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText
p x = (((n x) >>> m) &&& ((n x) >>> o))
in html >>> catA [p x | x <- [0..sz]]
main = do
dt <- readFile "test.html"
let html = parseHtml dt
count <- (runX . countRows) html
let cnt = ((head count) `div` 2) - 1
prcssd <- (runX . (parsePage cnt)) html
print prcssd
And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]
However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.
The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.
Here's a somewhat simpler implementation.
import Text.XML.HXT.Core
import Text.HandsomeSoup
group2 :: [a] -> [(a, a)]
group2 [] = []
group2 (x0:x1:xs) = (x0, x1) : group2 xs
parsePage :: ArrowXml a => a XmlTree (String, String)
parsePage = let
trPairs = deep (hasName "tr") >>. group2
insideLink = deep (hasName "a") /> getText
insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText
in trPairs >>> (insideLink *** insideFont)
main = do
dt <- readFile "test.html"
let html = parseHtml dt
prcssd <- runX $ html >>> parsePage
print prcssd
The >>.
operator can be used instead of >.
so that you don't need to call unlistA
afterwards.
I changed the group2
function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.
The type of trPairs
is
trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)
i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr>
nodes). Now we can use the ***
operator from Control.Arrow
to apply a transformation to either element of the pair, insideLink
for the first one and insideFont
for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.
I did some html parsing with hxt a few weeks ago and thought, that xpath comes in quite handy. Unfortunately, I didn't come up with a perfect solution for your problem, but it might be a start for a new try.
import Text.XML.HXT.Core
import Text.XML.HXT.XPath.Arrows
type XmlTreeValue a = a XmlTree String
type ParsedXmlTree a = a XmlTree XmlTree
type IOXmlTree = IOSArrow XmlTree XmlTree
-- parses a given .html file
parseHtml :: FilePath -> IOStateArrow s b XmlTree
parseHtml path = readDocument [withParseHTML yes, withWarnings no] path
-- "" for stdout
saveHtml :: IOXmlTree
saveHtml = writeDocument [withIndent yes] ""
extract :: IOXmlTree
extract = processChildren (process `when` isElem)
-- main processing functon
processHtml :: FilePath -> IO ()
processHtml src =
runX (parseHtml src >>> extract >>> saveHtml)
>> return ()
-- process the html structure
process :: ArrowXml cat => ParsedXmlTree cat
process =
-- create tag <structure> for the expression given next
selem "structure"
-- navigate to <html><body><table><tr>...
[(getXPathTrees "/html/body/table/tr")
-- then combine the results
>>> (getTheName <+> getWhere)]
-- selects text at path <td><font><a...> </a></font></td> and creates <name>-Tag
-- (// means that all <td>-tags are analysed,
-- but I'm not quite sure why this is relevant here)
getTheName :: ArrowXml cat => ParsedXmlTree cat
getTheName = selem "name" [getXPathTrees "//td/font/a/text()"]
-- selects text at path <td><font><a...> </a></font></td>
-- (where the forth font-tag is taken) and creates <where>-Tag
getWhere :: ArrowXml cat => ParsedXmlTree cat
getWhere = selem "where" [getXPathTrees "//td/font[4]/text()"]
The result looks like this:
*Main> processHtml "test.html"
<?xml version="1.0" encoding="UTF-8"?>
<structure>
<name>ABC</name>
<where/>
<name/>
<where>Here</where>
<name>EFG</name>
<where/>
<name/>
<where>There</where>
<name>HIJ</name>
<where/>
<name/>
<where>Far away</where>
</structure>
Like I said, not quite perfect, but hopefully a start.
EDIT: Maybe this looks more like your approach. Still, instead of dropping the elements you don't care about, we first choose all elements that fit and filter the results. I think it's quite fascinating that there's no generic approach for such a problem. Because, somehow, the font[4]-selection does not work with my other approach - but maybe I'm just not a good xpath user.
processHtml :: FilePath -> IO [(String,String)]
processHtml src = do
names <- runX (parseHtml src >>> process1)
fontTags <- runX (parseHtml src >>> process2)
let wheres = filterAfterWhere fontTags
let result = zip names wheres
return result
where filterAfterWhere [] = []
filterAfterWhere xs = case dropWhile (/= "Where:") xs of
[] -> []
[x] -> [x]
_:y:ys -> y : filterAfterWhere ys
process1 :: ArrowXml cat => XmlTreeValue cat
process1 = textNodeToText getTheName
process2 :: ArrowXml cat => XmlTreeValue cat
process2 = textNodeToText getWhere
getTheName :: ArrowXml cat => ParsedXmlTree cat
getTheName = getXPathTrees "//td/font/a/text()"
getWhere :: ArrowXml cat => ParsedXmlTree cat
getWhere = getXPathTrees "//td/font/text()"
-- neet function to select a value within a XmlTree as String
textNodeToText :: ArrowXml cat => ParsedXmlTree cat -> XmlTreeValue cat
textNodeToText selector = selector `when` isElem >>> getText
This way, you get the result you showed in your question:
*Main> processHtml "test.html"
[("ABC","Here"),("EFG","There"),("HIJ","Far away")]
Edit2:
Fun fact: it seems like the hxt-xpath library does not work quite right for such an index-selection. An online XPath-evaluator shows the right behaviour for //td/font[4]/text()
.
来源:https://stackoverflow.com/questions/14962293/group-html-table-rows-with-hxt-in-haskell