reading csv in Julia is slow compared to Python

后端未结

关注

 7  2019

北荒 2020-12-16 11:10

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns. <

7条回答

不知归路 (楼主)

2020-12-16 11:27

Let us first create a file you are talking about to provide reproducibility:

open("myFile.txt", "w") do io
    foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end

Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.

Single threaded:

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)

and using e.g. 4 threads:

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  0.812742 seconds (47.28 k allocations: 1.208 GiB)

In R it is:

> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+                                stringsAsFactors=F,na.strings=""))
   user  system elapsed 
 28.615   0.436  29.048

In Python (Pandas) it is:

>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526

Now if we test fread from R (which is fast) we get:

> system.time(fread("myFile.txt", sep="|", header=F,
                    stringsAsFactors=F, na.strings="", nThread=1))
   user  system elapsed 
  1.043   0.036   1.082 
> system.time(fread("myFile.txt", sep="|", header=F,
                    stringsAsFactors=F, na.strings="", nThread=4))
   user  system elapsed 
  1.361   0.028   0.416

So in this case the summary is:

despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)

EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.

Data preparation step:

open("myFile.txt", "w") do io
    foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end

CSV.jl, single threaded:

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  0.029965 seconds (248 allocations: 17.037 MiB)

Pandas:

>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406

Conclusions:

the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
for small files CSV.jl is faster than Pandas (if we exclude compilation cost)

Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.

From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.

Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.

0 讨论(0)

查看其它7个回答