Selected rows in data.table not being removed first time (must remove twice)

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-16 05:05:11

问题


I'm getting some strange behaviour with data.table in R. I want to keep only a certain subset of rows, e.g., DT <- DT[max.seq == 1], which (I thought) always worked fine in the past. But with this particular data set I don't know if it's my code or some data.table functionality that I've misunderstood.

It seems the command to remove rows I don't want needs to be run twice to work properly.

Specifically, I'm trying to remove non-sequential firm-level time series by keeping only the longest continuous sequence for each firm (or the most recent sequence if there are multiple maximal length sequences).

========

Here's a subset of the data I'm using:

library(data.table)
DT <- data.table(
       gvkey =  c(7221, 7221, 7221, 7221, 7221, 7221, 7221, 7221, 7392, 7392, 7392, 7392, 7392, 
                  7392, 7392, 7392, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 
                  8344, 8344, 10589, 10589, 10589, 10589, 11759, 11759, 12675, 12675, 12675, 12675, 
                  12675, 12675, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 
                  1312, 1312, 13910, 13910, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 
                  17286, 17286, 17286, 17286, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 
                  2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 
                  2090, 2090, 2090, 2090, 2090, 2090, 2212, 2212, 2212),
       fyear =  c(1982, 1983, 1984, 1985, 1990, 1991, 1992, 1993, 1975, 1976, 1977, 1978, 1983, 
                  1984, 1985, 1986, 1982, 1983, 1984, 1985, 1986, 1987, 1990, 1991, 1992, 1993, 
                  1994, 1995, 1978, 1979, 1983, 1984, 1984, 1988, 1985, 1986, 1987, 2001, 2002, 
                  2003, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 
                  1986, 1986, 1989, 1989, 1990, 1991, 1992, 1993, 1994, 2001, 2002, 2003, 2004, 
                  2005, 2006, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 
                  1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 
                  1980, 1981, 1982, 1983, 1982, 1983, 1984))

setkey(DT, gvkey)

===========

I then run the following commands to create a binary variable (max.seq) that is 1 for each row corresponding to each firm's (i.e., gvkey) longest, and then do so again with one.segment to keep the most recent sequence where necessary.

DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
DT[, gap := fyear - fyear.lag]

DT[,  step.idx := 0]    # initialize
DT[gap >=2, step.idx := 1]    # 1's at each multi-year jump
DT[,        step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ ,  seq.lengths := .N,  by=.(gvkey,step.idx)]      # length of each sequence
DT[,   max.seq := max(seq.lengths), by = gvkey]       # each firm's longest sequence

DT <- DT[max.seq == seq.lengths]  # Keep only the longest sequence(s)

Now this is not the most efficient method since I make the copy above when removing the non-longest time series, and then do that again below when I keep on the most recent time series of equal-length maximum series -- but I don't think this should affect the functionality issue I'm having.

DT[, one.segment := 1*(max.seq == .N), by= gvkey] # 0 if there multiple series remain

DT[one.segment == 0,  # make the last max.seq elements 1, leave the rest as 0
    one.segment := c(rep(0, (.N-max.seq[1])), rep(1, max.seq[1])), by=gvkey]

EDITED to Report Full Output

I start with

 nrow(DT) # [1] 98
 DT[one.segment ==0, .N] # [1] 14

Then keep only the one.segment==1 rows.

DT.out <- DT[one.segment == 1] # Finished! ... or am I?

I should now have no one.segment == 0 cases left, but I do.

 nrow(DT.out) # [1] 76
 DT.out[one.segment ==0, .N] # [1] 13

But if I run the row removal command again then the problem is solved (both for this example and for my full data set nrow(DT)>35000).

DT.out2 <- DT.out[one.segment == 1]
nrow(DT.out2)  # [1] 63
DT.out[one.segment ==0, .N]  # [1] 0

What am I missing?

Thanks!

** OUTPUT **

> DT.out
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment
 1:  1312  1974        NA  NA        0          13      13           1
 2:  1312  1975      1974   1        0          13      13           1
 3:  1312  1976      1975   1        0          13      13           1
 4:  1312  1977      1976   1        0          13      13           1
 5:  1312  1978      1977   1        0          13      13           1
 6:  1312  1979      1978   1        0          13      13           1
 7:  1312  1980      1979   1        0          13      13           1
 8:  1312  1981      1980   1        0          13      13           1
 9:  1312  1982      1981   1        0          13      13           1
10:  1312  1983      1982   1        0          13      13           1
11:  1312  1984      1983   1        0          13      13           1
12:  1312  1985      1984   1        0          13      13           1
13:  1312  1986      1985   1        0          13      13           1
14:  2090  1956        NA  NA        0          28      28           1
15:  2090  1957      1956   1        0          28      28           1
16:  2090  1958      1957   1        0          28      28           1
17:  2090  1959      1958   1        0          28      28           1
18:  2090  1960      1959   1        0          28      28           1
19:  2090  1961      1960   1        0          28      28           1
20:  2090  1962      1961   1        0          28      28           1
21:  2090  1963      1962   1        0          28      28           1
22:  2090  1964      1963   1        0          28      28           1
23:  2090  1965      1964   1        0          28      28           1
24:  2090  1966      1965   1        0          28      28           1
25:  2090  1967      1966   1        0          28      28           1
26:  2090  1968      1967   1        0          28      28           1
27:  2090  1969      1968   1        0          28      28           1
28:  2090  1970      1969   1        0          28      28           1
29:  2090  1971      1970   1        0          28      28           1
30:  2090  1972      1971   1        0          28      28           1
31:  2090  1973      1972   1        0          28      28           1
32:  2090  1974      1973   1        0          28      28           1
33:  2090  1975      1974   1        0          28      28           1
34:  2090  1976      1975   1        0          28      28           1
35:  2090  1977      1976   1        0          28      28           1
36:  2090  1978      1977   1        0          28      28           1
37:  2090  1979      1978   1        0          28      28           1
38:  2090  1980      1979   1        0          28      28           1
39:  2090  1981      1980   1        0          28      28           1
40:  2090  1982      1981   1        0          28      28           1
41:  2090  1983      1982   1        0          28      28           1
42:  2212  1982        NA  NA        0           3       3           1
43:  2212  1983      1982   1        0           3       3           1
44:  2212  1984      1983   1        0           3       3           1
45:  8344  1990      1987   3        1           6       6           1
46:  8344  1991      1990   1        1           6       6           1
47:  8344  1992      1991   1        1           6       6           1
48:  8344  1993      1992   1        1           6       6           1
49:  8344  1994      1993   1        1           6       6           1
50:  8344  1995      1994   1        1           6       6           1
51: 10589  1978        NA  NA        0           2       2           0
52: 10589  1979      1978   1        0           2       2           0
53: 10589  1983      1979   4        1           2       2           1
54: 10589  1984      1983   1        1           2       2           1
55: 11759  1984        NA  NA        0           1       1           0
56: 11759  1988      1984   4        1           1       1           1
57: 12675  1985        NA  NA        0           3       3           0
58: 12675  1986      1985   1        0           3       3           0
59: 12675  1987      1986   1        0           3       3           0
60: 12675  2001      1987  14        1           3       3           1
61: 12675  2002      2001   1        1           3       3           1
62: 12675  2003      2002   1        1           3       3           1
63: 13910  1986        NA  NA        0           1       1           0
64: 13910  1989      1986   3        1           1       1           1
65: 17286  1989        NA  NA        0           6       6           0
66: 17286  1990      1989   1        0           6       6           0
67: 17286  1991      1990   1        0           6       6           0
68: 17286  1992      1991   1        0           6       6           0
69: 17286  1993      1992   1        0           6       6           0
70: 17286  1994      1993   1        0           6       6           0
71: 17286  2001      1994   7        1           6       6           1
72: 17286  2002      2001   1        1           6       6           1
73: 17286  2003      2002   1        1           6       6           1
74: 17286  2004      2003   1        1           6       6           1
75: 17286  2005      2004   1        1           6       6           1
76: 17286  2006      2005   1        1           6       6           1
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment

** Session Info ***

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] tools_3.2.3  chron_2.3-47

来源:https://stackoverflow.com/questions/36238521/selected-rows-in-data-table-not-being-removed-first-time-must-remove-twice

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!