With Haskell, how do I process large volumes of XML?

≡放荡痞女 提交于 2019-12-02 18:19:26

I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

I'm not sure which support bytestrings, but that's the condition you're looking for.

Below is an example that uses hexpat:

{-# LANGUAGE PatternGuards #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
  | Just ouid <- lookup "OwnerUserId" as = ouid == uid
  | otherwise = False
ownedBy _ _ = False

The definition of ownedBy is a little clunky. Maybe a view pattern instead:

{-# LANGUAGE ViewPatterns #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False

ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing

You could try my fast-tagsoup library. It's a simple replacement to tagsoup and parses at speeds of 20-200MB/sec.

The problem with tagsoup package is that it works with String internally even if you use Text or ByteString interface. fast-tagsoup works with strict ByteStrings using high-performance low-level parsing while still returning lazy tags list as output.

ja.

TagSoup supports ByteString via its class Text.StringLike. The only changes needed to your example were to call ByteString.Lazy’s readFile, and add a fromString to the fromAttrib:

import Text.StringLike
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BSC

userid = "83805"
file = "blah//posts.xml"
main = do
posts <- liftM parseTags (BSL.readFile file)
print $ head $ map (fromAttrib (fromString "Id")) $
               filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
               posts  

Your example ran for me (4 gig RAM), taking 6 minutes; the ByteString version took 10 minutes.

I had a similar problem (using HXT) - I avoided the memory issue by using the Expat parser with HXT. On a 5MB XML file, just reading the document and printing it: peak memory consumption went from 2Gigs to about 180MB, and the execution time was much shorter (didn't measure).

Perhaps you need a lazy XML parser: your usage looks like a pretty straightforward scan through the input. HaXml has a lazy parser, although you must ask for it explicitly by importing the correct module.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!