reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns. <
Let us first create a file you are talking about to provide reproducibility:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end
Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.
Single threaded:
julia> @time CSV.File("myFile.txt", delim='|', header=false);
4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
julia> @time CSV.File("myFile.txt", delim='|', header=false);
2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
and using e.g. 4 threads:
julia> @time CSV.File("myFile.txt", delim='|', header=false);
4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
julia> @time CSV.File("myFile.txt", delim='|', header=false);
0.812742 seconds (47.28 k allocations: 1.208 GiB)
In R it is:
> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+ stringsAsFactors=F,na.strings=""))
user system elapsed
28.615 0.436 29.048
In Python (Pandas) it is:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526
Now if we test fread from R (which is fast) we get:
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=1))
user system elapsed
1.043 0.036 1.082
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=4))
user system elapsed
1.361 0.028 0.416
So in this case the summary is:
CSV.File in Julia when you run it for the first time it is significantly faster than base R or Pythonfread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.
Data preparation step:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end
CSV.jl, single threaded:
julia> @time CSV.File("myFile.txt", delim='|', header=false);
1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
julia> @time CSV.File("myFile.txt", delim='|', header=false);
0.029965 seconds (248 allocations: 17.037 MiB)
Pandas:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406
Conclusions:
Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.
From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.
Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.