Haskell HXT for extracting a list of values

佐手、 提交于 2019-11-30 20:56:33

Here's one approach (my types are a bit more general and I'm not using XPath):

{-# LANGUAGE Arrows #-}
module Main where

import qualified Data.Map as M
import Text.XML.HXT.Arrow

classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> divs >>> pairs) >>> arr M.fromList
  where
    divs = getChildren >>> hasName "div"
    pairs = proc div -> do
      cls <- getAttrValue "class" -< div
      val <- deep getText         -< div
      returnA -< (cls, val)

getValues :: (ArrowXml a) => [String] -> a XmlTree [(String, Maybe String)]
getValues cs = classes >>> arr (zip cs . lookupValues cs)
  where lookupValues cs m = map (flip M.lookup m) cs

main = do
  let xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
            \<div class='c3'>123</div><div class='c4'>234</div></div>"

  print =<< runX (readString [] xml >>> getValues ["c1", "c2", "c3", "c4"])

I would probably run an arrow to get the map and then do the lookups, but this way works as well.


To answer your question about listA: divs >>> divs >>> pairs is a list arrow with type a XmlTree (String, String)—i.e., it's a non-deterministic computation that takes an XML tree and returns string pairs.

arr M.fromList has type a [(String, String)] (M.Map String String). This means we can't just compose it with divs >>> divs >>> pairs, since the types don't match up.

listA solves this problem: it collapses divs >>> divs >>> pairs into a deterministic version with type a XmlTree [(String, String)], which is exactly what we need.

Here is a way to do it using HandsomeSoup:

-- For the join function.
import Data.String.Utils
import Text.HandsomeSoup
import Text.XML.HXT.Core

-- Of each element, get class attribute and text.
getItem = (this ! "class" &&& (this /> getText))  
getItems selectors = css (join "," selectors) >>> getItem

main = do
  let selectors = [".c1", ".c2", ".c3", ".c4"]
  items <- runX (readDocument [] "data.html" >>> getItems selectors)
  print items

data.html is the HTML file.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!