I\'ve written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep>
Yet another solution that comes to mind is to use unsafeInterleaveIO from System.IO.Unsafe. See the reply of Tomasz Zielonka in this thread in Haskell Cafe.
It defers an input-output operation (opening a file) until it is actually required. Thus it is possible to avoid opening all files at once, and instead read and process them sequentially (open them lazily).
Now, I believe, mapM getFileLine opens all files but does not start reading from them until putStr . unlines. Thus a lot of thunks with open file handlers float around, this is the problem. (Please correct me if I am wrong).
A modified example with unsafeInterleaveIO is running against a 100 GB directory for several minutes now, in constant space.
getList :: FilePath -> IO [String]
getList p =
let getFileLine path =
liftM (\c -> (show . md5 $ c) ++ " " ++ path)
(unsafeInterleaveIO $ BS.readFile path)
in mapM getFileLine =<< getRecursiveContents p
(I changed for pureMD5 implementation of the hash)
P.S. I am not sure if this is good style. I believe that solutions with iteretees and strict IO are better, but this one is quicker to make. I use it in small scripts, but I'd be afraid of relying on it in a bigger program.