reading csv in Julia is slow compared to Python

后端 未结 7 2019
北荒
北荒 2020-12-16 11:10

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns. <

7条回答
  •  不知归路
    2020-12-16 11:27

    Let us first create a file you are talking about to provide reproducibility:

    open("myFile.txt", "w") do io
        foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
    end
    

    Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.

    Single threaded:

    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
    
    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
    

    and using e.g. 4 threads:

    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
    
    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      0.812742 seconds (47.28 k allocations: 1.208 GiB)
    

    In R it is:

    > system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
    +                                stringsAsFactors=F,na.strings=""))
       user  system elapsed 
     28.615   0.436  29.048 
    

    In Python (Pandas) it is:

    >>> import pandas as pd
    >>> import time
    >>> start=time.time()
    >>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
    >>> print(time.time()-start)
    25.95710587501526
    

    Now if we test fread from R (which is fast) we get:

    > system.time(fread("myFile.txt", sep="|", header=F,
                        stringsAsFactors=F, na.strings="", nThread=1))
       user  system elapsed 
      1.043   0.036   1.082 
    > system.time(fread("myFile.txt", sep="|", header=F,
                        stringsAsFactors=F, na.strings="", nThread=4))
       user  system elapsed 
      1.361   0.028   0.416 
    

    So in this case the summary is:

    • despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
    • it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)

    EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.

    Data preparation step:

    open("myFile.txt", "w") do io
        foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
    end
    

    CSV.jl, single threaded:

    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
    
    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      0.029965 seconds (248 allocations: 17.037 MiB)
    

    Pandas:

    >>> import pandas as pd
    >>> import time
    >>> start=time.time()
    >>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
    >>> print(time.time()-start)
    0.07587623596191406
    

    Conclusions:

    • the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
    • for small files CSV.jl is faster than Pandas (if we exclude compilation cost)

    Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.

    From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.

    Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.

提交回复
热议问题