Create lag / lead time series with by groups in Julia?

喜夏-厌秋 提交于 2021-01-02 18:12:23

问题


I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form

julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
                             var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1   │ var2  │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ a      │ 0     │
│ 2   │ a      │ 1     │
│ 3   │ a      │ 2     │
│ 4   │ a      │ 3     │
│ 5   │ b      │ 0     │
│ 6   │ b      │ 1     │
│ 7   │ b      │ 2     │
│ 8   │ b      │ 3     │

And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.

I have tried the following code which produces the following error.

julia> df2 = df1 |> @groupby(_.var1) |> @mutate(lag2 = lag(_.var2,2)) |> DataFrame

ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
  merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
  merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
  merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
  ...
Stacktrace:
 [1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
 [2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
 [3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
 [4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
 [5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
 [6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
 [7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
 [8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
 [9] top-level scope at none:0

Appreciate any help with this approach or alternate approaches. Thanks.


回答1:


You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:

julia> using DataFrames, ShiftedArrays

julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
                                    var2=[0,1,2,3,0,1,2,3]);

julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
│ Row │ var1   │ var2_l2 │
│     │ String │ Int64⍰  │
├─────┼────────┼─────────┤
│ 1   │ a      │ missing │
│ 2   │ a      │ missing │
│ 3   │ a      │ 0       │
│ 4   │ a      │ 1       │
│ 5   │ b      │ missing │
│ 6   │ b      │ missing │
│ 7   │ b      │ 0       │
│ 8   │ b      │ 1       │

Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:

julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)

julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
│ Row │ var1   │ var2_l2 │
│     │ String │ Int64   │
├─────┼────────┼─────────┤
│ 1   │ a      │ -1000   │
│ 2   │ a      │ -1000   │
│ 3   │ a      │ 0       │
│ 4   │ a      │ 1       │
│ 5   │ b      │ -1000   │
│ 6   │ b      │ -1000   │
│ 7   │ b      │ 0       │
│ 8   │ b      │ 1       │

EDIT

Under DataFrames.jl 0.22.2 the correct syntax is:

julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
 Row │ var1    var2_l2 
     │ String  Int64?  
─────┼─────────────────
   1 │ a       missing 
   2 │ a       missing 
   3 │ a             0
   4 │ a             1
   5 │ b       missing 
   6 │ b       missing 
   7 │ b             0
   8 │ b             1


来源:https://stackoverflow.com/questions/61038354/create-lag-lead-time-series-with-by-groups-in-julia

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!