Read CSV files faster in Julia

问题

I have noticed that loading a CSV file using CSV.read is quite slow. For reference, I am attaching one example of time benchmark:

using CSV, DataFrames
file = download("https://github.com/foursquare/twofishes")
@time CSV.read(file, DataFrame)

Output: 
9.450861 seconds (22.77 M allocations: 960.541 MiB, 5.48% gc time)
297 rows × 2 columns

This is a random dataset, and a python alternate of such operation compiles in fraction of time compared to Julia. Since, julia is faster than python why is this operation takes this much time? Moreover, is there any faster alternate to reduce the compile timing?

回答1:

You are measuring the compile together with runtime.

One correct way to measure the time would be:

@time CSV.read(file, DataFrame)
@time CSV.read(file, DataFrame)

At the first run the function compiles at the second run you can use it.

Another option is using BenchmarkTools:

using BenchmarkTools
@btime CSV.read(file, DataFrame)

Normally, one uses Julia to work with huge datasets so that single initial compile time is not important. However, it is possible to compile CSV and DataFrame into Julia's system image and have fast execution from the first run, for isntructions see here: Why julia takes long time to import a package? (this is however more advanced usually one does not need it)

You also have yet another option which is reducing the optimization level for the compiler (this would be for scenarios where your workload is small and restarted frequently and you do not want all complexity that comes with image building. In this cage you would run Julia as:

julia --optimize=0 my_code.jl

Finally, like mentioned by @Oscar Smith in the forthcoming Julia 1.6 the compile times will be slightly shorter.

来源：https://stackoverflow.com/questions/65660180/read-csv-files-faster-in-julia

标签

performance

csv

time

julia

benchmarking