Group html table rows with HXT in Haskell

匿名 (未验证) 提交于 2019-12-03 08:28:06

问题:

I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:

<html> <body> <table>  <tr>      <td>          <font >          <a href="a">ABC</a></font>      </td>  </tr>  <tr>      <td height="50">          <font>When:</font><font>19-1-2013</font>           <b><font>&nbsp; </font></b>          <font>Where:</font><font>Here</font>          <font>Who:</font><font>Me</font>      </td>  </tr>  <tr>      <td>         <font >              <a href="b">EFG</a>         </font>      </td>  </tr>  <tr>      <td height="50">          <font>When:</font><font>19-2-2013</font>          <b><font>&nbsp; </font></b>          <font>Where:</font><font>There</font>          <font>Who:</font><font>You</font>      </td>  </tr>  <tr>      <td>         <font >             <a href="c">HIJ</a>         </font>      </td>  </tr>  <tr>      <td height="50">          <font>When:</font><font>19-3-2013</font><b>          <font>&nbsp; </font></b>          <font>Where:</font><font>Far away</font>          <font>Who:</font><font>Him</font>      </td>  </tr> </table> </body> </html> 

To this, after several iterations, I arrived at this code to achieve what I want:

import Data.List import Control.Arrow.ArrowNavigatableTree import Text.XML.HXT.Core import Text.HandsomeSoup  group2 [] = [] group2 (x0:x1:xs) = [x0,x1]:(group2 xs)  countRows html = html >>> deep (hasName "tr") >. length  parsePage sz html = let   n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA   m = deep (hasName "td") >>> css "a" /> getText   o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText   p x = (((n x) >>> m) &&& ((n x) >>> o))   in html >>> catA [p x | x <- [0..sz]]  main = do     dt <- readFile "test.html"     let html = parseHtml dt     count <- (runX . countRows) html     let cnt = ((head count) `div` 2) - 1     prcssd <- (runX . (parsePage cnt)) html     print prcssd 

And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]

However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.

The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.

回答1:

Here's a somewhat simpler implementation.

import Text.XML.HXT.Core import Text.HandsomeSoup  group2 :: [a] -> [(a, a)] group2 [] = [] group2 (x0:x1:xs) = (x0, x1) : group2 xs  parsePage :: ArrowXml a => a XmlTree (String, String) parsePage = let     trPairs    = deep (hasName "tr") >>. group2     insideLink = deep (hasName "a") /> getText     insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText      in trPairs >>> (insideLink *** insideFont)   main = do     dt <- readFile "test.html"     let html = parseHtml dt     prcssd <- runX $ html >>> parsePage     print prcssd 

The >>. operator can be used instead of >. so that you don't need to call unlistA afterwards.

I changed the group2 function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.

The type of trPairs is

trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode) 

i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr> nodes). Now we can use the *** operator from Control.Arrow to apply a transformation to either element of the pair, insideLink for the first one and insideFont for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.



回答2:

I did some html parsing with hxt a few weeks ago and thought, that xpath comes in quite handy. Unfortunately, I didn't come up with a perfect solution for your problem, but it might be a start for a new try.

import Text.XML.HXT.Core import Text.XML.HXT.XPath.Arrows  type XmlTreeValue a = a XmlTree String type ParsedXmlTree a = a XmlTree XmlTree type IOXmlTree = IOSArrow XmlTree XmlTree  -- parses a given .html file parseHtml :: FilePath -> IOStateArrow s b XmlTree parseHtml path = readDocument [withParseHTML yes, withWarnings no] path  -- "" for stdout saveHtml :: IOXmlTree saveHtml = writeDocument [withIndent yes] ""  extract :: IOXmlTree extract = processChildren (process `when` isElem)  -- main processing functon processHtml :: FilePath -> IO () processHtml src =   runX (parseHtml src >>> extract >>> saveHtml)    >> return ()  -- process the html structure process :: ArrowXml cat => ParsedXmlTree cat process =   -- create tag <structure> for the expression given next   selem "structure"     -- navigate to <html><body><table><tr>...     [(getXPathTrees "/html/body/table/tr")       -- then combine the results       >>> (getTheName <+> getWhere)]   -- selects text at path <td><font><a...> </a></font></td> and creates <name>-Tag  -- (// means that all <td>-tags are analysed,  --  but I'm not quite sure why this is relevant here)  getTheName :: ArrowXml cat => ParsedXmlTree cat  getTheName = selem "name" [getXPathTrees "//td/font/a/text()"]   -- selects text at path <td><font><a...> </a></font></td>  -- (where the forth font-tag is taken) and creates <where>-Tag  getWhere  :: ArrowXml cat => ParsedXmlTree cat  getWhere = selem "where" [getXPathTrees "//td/font[4]/text()"] 

The result looks like this:

*Main> processHtml "test.html" <?xml version="1.0" encoding="UTF-8"?> <structure>  <name>ABC</name>  <where/>  <name/>  <where>Here</where>  <name>EFG</name>  <where/>  <name/>  <where>There</where>  <name>HIJ</name>  <where/>  <name/>  <where>Far away</where> </structure> 

Like I said, not quite perfect, but hopefully a start.

EDIT: Maybe this looks more like your approach. Still, instead of dropping the elements you don't care about, we first choose all elements that fit and filter the results. I think it's quite fascinating that there's no generic approach for such a problem. Because, somehow, the font[4]-selection does not work with my other approach - but maybe I'm just not a good xpath user.

processHtml :: FilePath -> IO [(String,String)] processHtml src = do   names <- runX (parseHtml src >>> process1)   fontTags <- runX (parseHtml src >>> process2)   let wheres = filterAfterWhere fontTags   let result = zip names wheres   return result  where filterAfterWhere [] = []        filterAfterWhere xs = case dropWhile (/= "Where:") xs of                                []     -> []                                [x]    -> [x]                                _:y:ys -> y : filterAfterWhere ys  process1 :: ArrowXml cat => XmlTreeValue cat process1 = textNodeToText getTheName  process2 :: ArrowXml cat => XmlTreeValue cat process2 =  textNodeToText getWhere  getTheName :: ArrowXml cat => ParsedXmlTree cat getTheName = getXPathTrees "//td/font/a/text()"  getWhere  :: ArrowXml cat => ParsedXmlTree cat getWhere = getXPathTrees "//td/font/text()"  -- neet function to select a value within a XmlTree as String textNodeToText :: ArrowXml cat => ParsedXmlTree cat -> XmlTreeValue cat textNodeToText selector = selector `when` isElem >>> getText 

This way, you get the result you showed in your question:

*Main> processHtml "test.html" [("ABC","Here"),("EFG","There"),("HIJ","Far away")] 

Edit2:

Fun fact: it seems like the hxt-xpath library does not work quite right for such an index-selection. An online XPath-evaluator shows the right behaviour for //td/font[4]/text().



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!