问题
The data
First of all, let's generate some input so we have concrete data to talk about:
python -c 'for f in xrange(4000000):  print f' > input.txt
this will generate a file input.txt containing the numbers from 0 to 3999999, each on its own line. That means we should have a file with 4,000,000 lines, adding up to 30,888,890 bytes, roughly 29 MiB.
Everything as a list
Right, let's load everything into memory as a [Text]:
import Data.Conduit
import Data.Text (Text)
import Control.Monad.Trans.Resource (runResourceT)
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
main :: IO ()
main = do
hs <- (runResourceT
          $ CB.sourceFile "input.txt"
         $$ CT.decode CT.utf8
         =$ CT.lines
         =$ CL.fold (\b a -> a `seq` b `seq` a:b) [])
print $ head hs
and run it:
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...
"3999999"
2,425,996,328 bytes allocated in the heap
972,945,088 bytes copied during GC
280,665,656 bytes maximum residency (13 sample(s))
5,120,528 bytes maximum slop
533 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed)  Avg pause  Max pause
Gen  0      4378 colls,     0 par    0.296s   0.309s     0.0001s    0.0009s
Gen  1        13 colls,     0 par    0.452s   0.661s     0.0508s    0.2511s
INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.460s  (  0.465s elapsed)
GC      time    0.748s  (  0.970s elapsed)
EXIT    time    0.002s  (  0.034s elapsed)
Total   time    1.212s  (  1.469s elapsed)
%GC     time      61.7%  (66.0% elapsed)
Alloc rate    5,271,326,694 bytes per MUT second
Productivity  38.3% of total user, 31.6% of total elapsed
real    0m1.481s
user    0m1.212s
sys 0m0.232s
runs in 1.4s, takes 533 MB of memory. As of Haskell Wiki's Memory Footprint, the 4M Text instances should take 6 words + 2N bytes of memory. I'm on 64 bit so one word is 8 bytes. That means it should be (6 * 8 bytes * 4000000) + (2*26888890) bytes = 234 MiB. (The 26888890 are all the bytes in input.txt that are not newline characters). For the list which will hold them all, we'll need additional memory of   (1 + 3N) words + N * sizeof(v). sizeof(v) should be 8 because it'll be a pointer to the Text. The list should then use (1 + 3 * 4000000) * 8 bytes + 4000000 * 8 bytes = 122MiB. So in total (list + strings) we'd expect 356 MiB of memory used. I don't know where difference of 177 MiB (50%) of our memory went but let's ignore that for now.
The large hash set
Finally, we shall come to the use case that I'm actually interested in: Storing all the words in a large Data.HashSet. For that, I changed the program ever so slightly
import Data.Conduit
import Data.Text (Text)
import Control.Monad.Trans.Resource (runResourceT)
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.HashSet as HS
main :: IO ()
main = do
hs <- (runResourceT
          $ CB.sourceFile "input.txt"
         $$ CT.decode CT.utf8
         =$ CT.lines
         =$ CL.fold (\b a -> a `seq` b `seq` HS.insert a b) HS.empty)
print $ HS.size hs
if we run that again
$ ghc -fforce-recomp -O3 -rtsopts Test.hs && time ./Test +RTS -sstderr
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...
4000000
6,544,900,208 bytes allocated in the heap
6,314,477,464 bytes copied during GC
442,295,792 bytes maximum residency (26 sample(s))
8,868,304 bytes maximum slop
1094 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed)  Avg pause  Max pause
Gen  0     12420 colls,     0 par    5.756s   5.869s     0.0005s    0.0034s
Gen  1        26 colls,     0 par    3.068s   3.633s     0.1397s    0.6409s
INIT    time    0.000s  (  0.000s elapsed)
MUT     time    3.567s  (  3.592s elapsed)
GC      time    8.823s  (  9.502s elapsed)
EXIT    time    0.008s  (  0.097s elapsed)
Total   time   12.399s  ( 13.192s elapsed)
%GC     time      71.2%  (72.0% elapsed)
Alloc rate    1,835,018,578 bytes per MUT second
Productivity  28.8% of total user, 27.1% of total elapsed
real    0m13.208s
user    0m12.399s
sys 0m0.646s
it's quite bad: 13s and 1094MiB of memory used. The memory footprint page lists 4.5N words + N * sizeof(v) for a hash set, that should become (4.5 * 4000000 * 8bytes) + (4000000 * 8bytes) = 167 MiB. Adding the storage for the stings (234 MiB), I'd expect 401 MiB which is more than double, and it feels quite slow on top of that :(.
Thought experiment: manually managing the memory
As a thought experiment: Using a language where we can manually control memory layout and implement the HashSet with Open addressing I'd expect the following to be the sizes. For fairness, I'll expect the strings to still be in UTF-16 (which is what Data.Text does). Given it's 26888890 characters in total (without newlines), the strings in UTF-16 should roughly be 53777780 bytes (2 * 26888890) = 51 MiB. We will need to store the length for every string, which will be 8 bytes * 4000000 = 30 MiB. And we will need space for the hash set (4000000 * 8 bytes), again 30 MiB. Given that the hash sets are normally increased exponentially, one would maybe expect 32 MiB or 64 MiB worst case. Let's go with the worst case: 64 MiB for the table + 30 MiB for the string lengths + 51 MiB for the actual string data, grand total of 145 MiB.
So given that Data.HashSet is not a specialised implementation for storing strings, the calculated 401 MiB would not be too bad but the actually used 1094 MiB seem a bit much waste.
The questions finally :)
So we finally got there:
- Where is the error in my calculations?
- Is there some problem in my implementation or is 1094 MiB really the best we can get?
Versions and stuff
- I should probably use ByteStrings instead ofTextas I only need ascii characters
- I'm on GHC 7.10.1 and unordered-containers-0.2.5.1
For comparison: 4,000,000 Ints:
import Data.List (foldl')
import qualified Data.HashSet as HS
main = do
    let hs = foldl' (\b a -> a `seq` b `seq` HS.insert a b) (HS.empty :: HS.HashSet Int) [1..4000000]
    print $ HS.size hs
doesn't look any better:
[...]
798 MB total memory in use (0 MB lost due to fragmentation)
[...]
real    0m9.956s
that's almost 800 MiB for 4M Ints!
来源:https://stackoverflow.com/questions/36251583/haskell-data-hashset-from-unordered-container-performance-for-large-sets