Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

泪湿孤枕 提交于 2019-11-28 18:21:47

This a comment on the comment of sdcvvc referring to this 'beautiful folding' essay It was so cool -- beautiful, as he says -- I couldn't resist adding Functor and Applicative instances and a few other bits of modernization. Simultaneous folding of, say, x y and z is a straightforward product: (,,) <$> x <*> y <*> z. I made a half-gigabyte file of small random ints and it took 10 seconds to give the -- admittedly trivial -- calculation of length, sum and maximum on my rusty laptop. It doesn't seem to be helped by further annotations, but the compiler could see Int was all I was interested in; the obvious map read . lines as a parser led to a hopeless space and time catastrophe so I unfolded with a crude use of ByteString.readInt; otherwise it is basically a Data.List process.

{-# LANGUAGE GADTs, BangPatterns #-}

import Data.List (foldl', unfoldr)
import Control.Applicative 
import qualified Data.ByteString.Lazy.Char8 as B

main = fmap readInts (B.readFile "int.txt") >>= print . fold allThree
  where allThree = (,,) <$> length_ <*> sum_ <*> maximum_

data Fold b c where  F ::  (a -> b -> a) -> a -> (a -> c) -> Fold b c
data Pair a b = P !a !b

instance Functor (Fold b) where  fmap f (F op x g) = F op x (f . g)

instance Applicative (Fold b) where
  pure c = F const () (const c)
  (F f x c) <*> (F g y c') = F (comb f g) (P x y) (c *** c')
    where comb f g (P a a') b = P (f a b) (g a' b)
          (***) f g (P x y) = f x ( g y)

fold :: Fold b c -> [b] -> c
fold (F f x c) bs = c $ (foldl' f x bs)

sum_, product_ :: Num a => Fold a a
length_ :: Fold a Int
sum_     = F (+) 0 id
product_ = F (*) 1 id
length_  = F (const . (+1)) 0 id
maximum_ = F max 0 id
readInts  = unfoldr $ \bs -> case B.readInt bs of
  Nothing      -> Nothing
  Just (n,bs2) -> if not (B.null bs2) then Just (n,B.tail bs2) 
                                      else Just (n,B.empty)

Edit: unsurprisingly, since we have to do with an unboxed type above, and an unboxed vector derived from e.g. a 2G file can fit in memory, this is all twice as fast and somewhat better behaved if it is given the obvious relettering for Data.Vector.Uboxed http://hpaste.org/69270 Of course this isn't relevant where one has types like LogEntry Note though that the Fold type and Fold 'multiplication' generalizes over sequential types without revision, thus e.g. the Folds associated with operations on Chars or Word8s can be simultaneously folded directly over a ByteString. One must first define a foldB, by relettering fold to use the foldl's in the various ByteString modules. But the Folds and products of Folds are the same ones you would fold a list or vector of Chars or Word8s

To process lazy data muiltiple times, in constant space, you can do three things:

  • re-build the lazy list from scratch n times
  • fuse n passes into a single sequential fold that does each step, in lock step.
  • use par to do n parallel traversals at the same time

Those are your options. The last one is the coolest :)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!