Haskell HXT for extracting a list of values

烂漫一生 提交于 2019-12-19 03:24:05

问题


I'm trying to figure my way through HXT with XPath and arrows at the same time and I'm completely stuck on how to think through this problem. I've got the following HTML:

<div>
<div class="c1">a</div> 
<div class="c2">b</div> 
<div class="c3">123</div> 
<div class="c4">234</div> 
</div>

which I've extracted into an HXT XmlTree. What I'd like to do is define a function (I think?):

getValues :: [String] -> IOSArrow Xmltree [(String, String)]

Which, if used as getValues ["c1", "c2", "c3", "c4"], will get me:

[("c1", "a"), ("c2", "b"), ("c3", "123"), ("c4", "234")]

Help please?


回答1:


Here's one approach (my types are a bit more general and I'm not using XPath):

{-# LANGUAGE Arrows #-}
module Main where

import qualified Data.Map as M
import Text.XML.HXT.Arrow

classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> divs >>> pairs) >>> arr M.fromList
  where
    divs = getChildren >>> hasName "div"
    pairs = proc div -> do
      cls <- getAttrValue "class" -< div
      val <- deep getText         -< div
      returnA -< (cls, val)

getValues :: (ArrowXml a) => [String] -> a XmlTree [(String, Maybe String)]
getValues cs = classes >>> arr (zip cs . lookupValues cs)
  where lookupValues cs m = map (flip M.lookup m) cs

main = do
  let xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
            \<div class='c3'>123</div><div class='c4'>234</div></div>"

  print =<< runX (readString [] xml >>> getValues ["c1", "c2", "c3", "c4"])

I would probably run an arrow to get the map and then do the lookups, but this way works as well.


To answer your question about listA: divs >>> divs >>> pairs is a list arrow with type a XmlTree (String, String)—i.e., it's a non-deterministic computation that takes an XML tree and returns string pairs.

arr M.fromList has type a [(String, String)] (M.Map String String). This means we can't just compose it with divs >>> divs >>> pairs, since the types don't match up.

listA solves this problem: it collapses divs >>> divs >>> pairs into a deterministic version with type a XmlTree [(String, String)], which is exactly what we need.




回答2:


Here is a way to do it using HandsomeSoup:

-- For the join function.
import Data.String.Utils
import Text.HandsomeSoup
import Text.XML.HXT.Core

-- Of each element, get class attribute and text.
getItem = (this ! "class" &&& (this /> getText))  
getItems selectors = css (join "," selectors) >>> getItem

main = do
  let selectors = [".c1", ".c2", ".c3", ".c4"]
  items <- runX (readDocument [] "data.html" >>> getItems selectors)
  print items

data.html is the HTML file.



来源:https://stackoverflow.com/questions/3897283/haskell-hxt-for-extracting-a-list-of-values

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!