Julia | DataFrame | Replacing missing Values

前端 未结 4 1419
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-02-20 13:53

How can we replace missing values with 0.0 for a column in a DataFrame?

相关标签:
4条回答
  • 2021-02-20 14:14

    This is a shorter and more updated answer since Julia introduced the missing attribute recently.

    using DataFrames
    df = DataFrame(A=rand(1:50, 5), B=rand(1:50, 5), C=vcat(rand(1:50,3), missing, rand(1:50))) ## Creating random 5 integers within the range of 1:50, while introducing a missing variable in one of the rows
    df = DataFrame(replace!(convert(Matrix, df), missing=>0)) ## Converting to matrix first, since replacing values directly within type dataframe is not allowed
    
    0 讨论(0)
  • 2021-02-20 14:15

    There are a few different approaches to this problem (valid for Julia 1.x):

    Base.replace!

    Probably the easiest approach is to use replace! or replace from base Julia. Here is an example with replace!:

    julia> using DataFrames
    
    julia> df = DataFrame(x = [1, missing, 3])
    3×1 DataFrame
    │ Row │ x       │
    │     │ Int64⍰  │
    ├─────┼─────────┤
    │ 1   │ 1       │
    │ 2   │ missing │
    │ 3   │ 3       │
    
    julia> replace!(df.x, missing => 0);
    
    julia> df
    3×1 DataFrame
    │ Row │ x      │
    │     │ Int64⍰ │
    ├─────┼────────┤
    │ 1   │ 1      │
    │ 2   │ 0      │
    │ 3   │ 3      │
    

    However, note that at this point the type of column x still allows missing values:

    julia> typeof(df.x)
    Array{Union{Missing, Int64},1}
    

    This is also indicated by the question mark following Int64 in column x when the data frame is printed out. You can change this by using disallowmissing! (from the DataFrames.jl package):

    julia> disallowmissing!(df, :x)
    3×1 DataFrame
    │ Row │ x     │
    │     │ Int64 │
    ├─────┼───────┤
    │ 1   │ 1     │
    │ 2   │ 0     │
    │ 3   │ 3     │
    

    Alternatively, if you use replace (without the exclamation mark) as follows, then the output will already disallow missing values:

    julia> df = DataFrame(x = [1, missing, 3]);
    
    julia> df.x = replace(df.x, missing => 0);
    
    julia> df
    3×1 DataFrame
    │ Row │ x     │
    │     │ Int64 │
    ├─────┼───────┤
    │ 1   │ 1     │
    │ 2   │ 0     │
    │ 3   │ 3     │
    

    Base.ismissing with logical indexing

    You can use ismissing with logical indexing to assign a new value to all missing entries of an array:

    julia> df = DataFrame(x = [1, missing, 3]);
    
    julia> df.x[ismissing.(df.x)] .= 0;
    
    julia> df
    3×1 DataFrame
    │ Row │ x      │
    │     │ Int64⍰ │
    ├─────┼────────┤
    │ 1   │ 1      │
    │ 2   │ 0      │
    │ 3   │ 3      │
    

    Base.coalesce

    Another approach is to use coalesce:

    julia> df = DataFrame(x = [1, missing, 3]);
    
    julia> df.x = coalesce.(df.x, 0);
    
    julia> df
    3×1 DataFrame
    │ Row │ x     │
    │     │ Int64 │
    ├─────┼───────┤
    │ 1   │ 1     │
    │ 2   │ 0     │
    │ 3   │ 3     │
    

    DataFramesMeta

    Both replace and coalesce can be used with the @transform macro from the DataFramesMeta.jl package:

    julia> using DataFramesMeta
    
    julia> df = DataFrame(x = [1, missing, 3]);
    
    julia> @transform(df, x = replace(:x, missing => 0))
    3×1 DataFrame
    │ Row │ x     │
    │     │ Int64 │
    ├─────┼───────┤
    │ 1   │ 1     │
    │ 2   │ 0     │
    │ 3   │ 3     │
    
    julia> df = DataFrame(x = [1, missing, 3]);
    
    julia> @transform(df, x = coalesce.(:x, 0))
    3×1 DataFrame
    │ Row │ x     │
    │     │ Int64 │
    ├─────┼───────┤
    │ 1   │ 1     │
    │ 2   │ 0     │
    │ 3   │ 3     │
    

    Additional documentation

    • Julia manual
    • Julia manual - function reference
    • DataFrames.jl manual
    0 讨论(0)
  • 2021-02-20 14:16

    The other answers are pretty good all over. If you are a real speed junky, perhaps the following might be for you:

    # prepare example
    using DataFrames
    df = DataFrame(A = 1.0:10.0, B = 2.0:2.0:20.0)
    df[ df[:A] %2 .== 0, :B ] = NA
    
    
    df[:B].data[df[:B].na] = 0.0 # put the 0.0 into NAs
    df[:B] = df[:B].data         # with no NAs might as well use array
    
    0 讨论(0)
  • 2021-02-20 14:19

    create df with some NAs

    using DataFrames
    df = DataFrame(A = 1.0:10.0, B = 2.0:2.0:20.0)
    df[ df[:B] %2 .== 0, :A ] = NA
    

    you'll see some NA in df... we now convert them to 0.0

    df[ isna(df[:A]), :A] = 0
    

    EDIT=NaNNA. Thanks @Reza

    0 讨论(0)
提交回复
热议问题