Listing all the contents of a directory by breadth-first order results in low efficiency

问题

I writed a Haskell module to list all the contents of a directory by breadth-first order. The below is the source code.

module DirElements (dirElem) where

import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

dirElem :: FilePath -> IO [[FilePath]]
dirElem dirPath = iterateM (not.null) (concatMapM getDirectoryContents') [dirPath] >>= return.tail

getDirectoryContents' :: FilePath -> IO [FilePath]
getDirectoryContents' dirPath = do
  isDir <- do doesDirectoryExist dirPath
  if isDir then dirContent else return [] where
    dirContent = do
      contents <- getDirectoryContents dirPath
      return.(map (dirPath</>)).tail.tail $ contents

iterateM :: (Monad m) => (a -> Bool) -> (a -> m a) -> a -> m [a]
iterateM fb f x = do --Notice: Due to the the implementation of >>=, iterateM can't be writen like iterate which gives a infinite list and have type of iterateM :: (Monad m) => (a -> Bool) -> (a -> m a) -> a -> m [a]
  if fb x
    then do
      tail <- do {fx <- f x; iterateM fb f fx}
      return (x:tail)
    else return []

concatMapM :: Monad m => (a -> m[b]) -> [a] -> m[b]
concatMapM f list = mapM f list >>= return.concat

It works correct but when performing on a large directory, it will "suspend" for a little while, and spring out all the results.

After a research I find it is the same question with sequence $ map return [1..]::[[Int]] see Why the Haskell sequence function can't be lazy or why recursive monadic functions can't be lazy

回答1:

I modified the older answer that Davorak linked to to use the new pipes library.

It uses StateP to keep a queue of untraversed directories so that it can do a breadth first traversal. It uses MaybeP for exiting from the loop, as a convenience.

import Control.Monad
import Control.Proxy
import Control.Proxy.Trans.Maybe
import Control.Proxy.Trans.State as S
import Data.Sequence hiding (filter)
import System.FilePath.Posix
import System.Directory

getUsefulContents :: FilePath -> IO [FilePath]
getUsefulContents path
  = fmap (filter (`notElem` [".", ".."])) $ getDirectoryContents path

traverseTree
    :: (Proxy p)
    => FilePath
    -> () -> Producer (MaybeP (StateP (Seq FilePath) p)) FilePath IO r
traverseTree path () = do
    liftP $ S.modify (|> path)
    forever $ do
        x <- liftP $ S.gets viewl
        case x of
            EmptyL    -> mzero
            file :< s -> do
                liftP $ S.put s
                respond file
                p <- lift $ doesDirectoryExist file
                when p $ do
                    names <- lift $ getUsefulContents file
                    let namesfull = map (file </>) names
                    liftP $ forM_ namesfull $ \name ->
                        S.modify (|> name)

This defines a breadth-first lazy producer of files. If you hook it up to a printing stage, it will print out the files as it traverses the tree:

main = runProxy $ evalStateK empty $ runMaybeK $
    traverseTree "/tmp" >-> putStrLnD

Laziness means that if you only demand 3 files, it will only traverse the tree as much as necessary to generate three files, then it will stop:

    main = runProxy $ evalStateK empty $ runMaybeK $
        traverseTree "/tmp" >-> takeB_ 3 >-> putStrLnD

If you want to learn more about the pipes library, then I recommend you read the tutorial.

回答2:

This comes up every once in a while and the answer ends up being use an iteratee like library. Most often suggested recently has been the Proxy library.

Streaming recursive descent of a directory in Haskell
Older pipes solution out of date and non-iteratee like solution breadth-first traversal of directory tree is not lazy

I have seen Conduit solutions before and a few elegant monadic solutions, but I am not finding them now.

回答3:

First of all, that's not related to strictness. Like many monads, IO is actually nonstrict in its monadic operations. This is related to lazy vs. eager I/O.

The problem is that you first do the directory traversal and then you process the result. You can improve that by using coroutines to interleave them. One simple way is to make the directory traversal take a callback as argument:

getDirectoryContents' :: (MonadIO m) => (FilePath -> m a) -> FilePath -> m ()
getDirectoryContents' k fp = {- ... -}

This is the simplest and least flexible solution. A more flexible solution is to actually implement coroutines. You can either roll your own coroutine monad by using free, monad-coroutine or operational, or you can use one of the many streaming abstractions like conduit, enumerator or pipes with the last one being my personal recommentation for simple cases like this one.

回答4:

Everyone is telling you to use iteratees or pipes or the like, which are the current popular approach. But there's another, classic way to do this! Just use unsafeInterleaveIO from System.IO.Unsafe. All this function of type IO a -> IO a does is modify an IO action so that it only actually performs the IO when the value thunk is demanded, which is exactly what you were asking for. You can use this to write an iterateM with your desired semantics trivially.

Examples like this are where unsafeInterleaveIO shines.

You have, however, I'm sure, noted the "unsafe" in the name -- there are other examples, where you want direct control over filehandles and resource usage or the like, where unsafeInterleaveIO will indeed be bad news, and potentially even introduce violations of referential transparency.

(see this answer for more discussion: When is unsafeInterleaveIO unsafe?)

But again, in a case like this, I think unsafeInterleaveIO is the obvious, correct, and straightforward result.

来源：https://stackoverflow.com/questions/14474545/listing-all-the-contents-of-a-directory-by-breadth-first-order-results-in-low-ef

标签

performance

loops

haskell

lazy-evaluation