Haskell: Scan Through a List and Apply A Different Function for Each Element

前端 未结 4 1996
我寻月下人不归
我寻月下人不归 2021-02-06 11:43

I need to scan through a document and accumulate the output of different functions for each string in the file. The function run on any given line of the file depends on what i

4条回答
  •  甜味超标
    2021-02-06 12:17

    First of all, I think that the answers others have supplied will work at least 95% of the time. It's always good practice to code for the problem at hand by using appropriate data types (or tuples in some cases). However, sometimes you really don't know in advance what you're looking for in the list, and in these cases trying to enumerate all possibilities is difficult/time-consuming/error-prone. Or, you're writing multiple variants of the same sort of thing (manually inlining multiple folds into one) and you'd like to capture the abstraction.

    Fortunately, there are a few techniques that can help.

    The framework solution

    (somewhat self-evangelizing)

    First, the various "iteratee/enumerator" packages often provide functions to deal with this sort of problem. I'm most familiar with iteratee, which would let you do the following:

    import Data.Iteratee as I
    import Data.Iteratee.Char
    import Data.Maybe
    
    -- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
    -- if you want to just return them as a list, you can use the built-in
    -- stream2list function
    
    -- next, create stream transformers
    -- given at :: B.ByteString -> Maybe Atom
    -- create a stream transformer from ByteString lines to Atoms
    atIter :: Enumeratee [B.ByteString] [Atom] m a
    atIter = I.mapChunks (catMaybes . map at)
    
    otIter :: Enumeratee [B.ByteString] [Sheet] m a
    otIter = I.mapChunks (catMaybes . map ot)
    
    -- finally, combine multiple processors into one
    -- if you have more than one processor, you can use zip3, zip4, etc.
    procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
    procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)
    
    -- and run it on some data
    runner :: FilePath -> IO ([Atom],[Sheet])
    runner filename = do
      resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
      run resultIter
    

    One benefit this gives you is extra composability. You can create transformers as you like, and just combine them with zip. You can even run the consumers in parallel if you like (although only if you're working in the IO monad, and probably not worth it unless the consumers do a lot of work) by changing to this:

    import Data.Iteratee.Parallel
    
    parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)
    

    The result of doing so isn't the same as a single for-loop - this will still perform multiple traversals of the data. However, the traversal pattern has changed. This will load a certain amount of data at once (defaultBufSize bytes) and traverse that chunk multiple times, storing partial results as necessary. After a chunk has been entirely consumed, the next chunk is loaded and the old one can be garbage collected.

    Hopefully this will demonstrate the difference:

    Data.List.zip:
      x1 x2 x3 .. x_n
                       x1 x2 x3 .. x_n
    
    Data.Iteratee.zip:
      x1 x2      x3 x4      x_n-1 x_n
           x1 x2      x3 x4           x_n-1 x_n
    

    If you're doing enough work that parallelism makes sense this isn't a problem at all. Due to memory locality, the performance is much better than multiple traversals over the entire input as Data.List.zip would make.

    The beautiful solution

    If a single-traversal solution really does make the most sense, you might be interested in Max Rabkin's Beautiful Folding post, and Conal Elliott's followup work (this too). The essential idea is that you can create data structures to represent folds and zips, and combining these lets you create a new, combined fold/zip function that only needs one traversal. It's maybe a little advanced for a Haskell beginner, but since you're thinking about the problem you may find it interesting or useful. Max's post is probably the best starting point.

提交回复
热议问题