Fastest way to read large binary file in Haskell?

前端 未结 2 1550
悲哀的现实
悲哀的现实 2020-12-17 23:20

I want to process a binary file that is too large to read into memory. Currently I use ByteString.Lazy.readFile to stream the bytes. I thought it would be a good idea to use

相关标签:
2条回答
  • 2020-12-17 23:54

    With only streaming and bytestring, one can write something like:

    import           Data.ByteString
    import           Streaming
    import qualified Streaming.Prelude as S
    import           System.IO
    
    fromHandle :: Int -> Handle -> Stream (Of ByteString) IO ()
    fromHandle chunkSize h = 
        S.untilRight $ do bytes <- Data.ByteString.hGet h chunkSize
                          pure $ if Data.ByteString.null bytes then Right ()
                                                               else Left bytes
    

    Using hGet, null from bytestring, and untilRight from streaming. You will need to use withFile to get the Handle, and consume the Stream within the callback:

    dump :: FilePath -> IO ()
    dump file = withFile file ReadMode go
     where
       go :: Handle -> IO ()
       go = S.mapM_ (Data.ByteString.hPut stdout) . fromHandle 4096 
    
    0 讨论(0)
  • 2020-12-17 23:56

    To elaborate on @Cubic's comment, while there's a general consensus that lazy I/O should be avoided in production code and replaced with a streaming approach, this is not directly related to performance. If you're writing a program to do some one-off processing of a large file, as long as you have a lazy I/O version running fine now, there's probably no good performance reason to convert it over to a streaming package.

    In fact, streaming is more likely to add some overhead, so I suspect that a well optimized lazy I/O solution would out-perform a well optimized streaming solution, in most cases.

    The main reasons for avoiding Lazy I/O have been previously discussed on SO. In a nutshell, lazy I/O makes it difficult to consistently manage resources (e.g., file handles and network sockets), makes it hard to reason about space usage (e.g., a small program change can cause your memory usage to explode), and is occasionally "unsafe" if the timing and ordering of the I/O in question matters (usually not a problem if you're just reading in one set of files and/or writing out another set of files).

    Short-running utility programs for reading and/or writing large files are probably good candidates to be written in a lazy I/O style. As long as they don't have any obvious space leaks when they're run, they're probably fine.

    0 讨论(0)
提交回复
热议问题