问题
I'm getting some strange behaviour with data.table
in R. I want to keep only a certain subset of rows, e.g., DT <- DT[max.seq == 1]
, which (I thought) always worked fine in the past. But with this particular data set I don't know if it's my code or some data.table
functionality that I've misunderstood.
It seems the command to remove rows I don't want needs to be run twice to work properly.
Specifically, I'm trying to remove non-sequential firm-level time series by keeping only the longest continuous sequence for each firm (or the most recent sequence if there are multiple maximal length sequences).
========
Here's a subset of the data I'm using:
library(data.table)
DT <- data.table(
gvkey = c(7221, 7221, 7221, 7221, 7221, 7221, 7221, 7221, 7392, 7392, 7392, 7392, 7392,
7392, 7392, 7392, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344,
8344, 8344, 10589, 10589, 10589, 10589, 11759, 11759, 12675, 12675, 12675, 12675,
12675, 12675, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312,
1312, 1312, 13910, 13910, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286,
17286, 17286, 17286, 17286, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090,
2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090,
2090, 2090, 2090, 2090, 2090, 2090, 2212, 2212, 2212),
fyear = c(1982, 1983, 1984, 1985, 1990, 1991, 1992, 1993, 1975, 1976, 1977, 1978, 1983,
1984, 1985, 1986, 1982, 1983, 1984, 1985, 1986, 1987, 1990, 1991, 1992, 1993,
1994, 1995, 1978, 1979, 1983, 1984, 1984, 1988, 1985, 1986, 1987, 2001, 2002,
2003, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
1986, 1986, 1989, 1989, 1990, 1991, 1992, 1993, 1994, 2001, 2002, 2003, 2004,
2005, 2006, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966,
1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,
1980, 1981, 1982, 1983, 1982, 1983, 1984))
setkey(DT, gvkey)
===========
I then run the following commands to create a binary variable (max.seq
) that is 1 for each row corresponding to each firm's (i.e., gvkey
) longest, and then do so again with one.segment
to keep the most recent sequence where necessary.
DT[, fyear.lag := shift(fyear, n=1L, type = "lag"), by = gvkey]
DT[, gap := fyear - fyear.lag]
DT[, step.idx := 0] # initialize
DT[gap >=2, step.idx := 1] # 1's at each multi-year jump
DT[, step.idx := cumsum(step.idx), by = gvkey] # indexes each sequence by firm
DT[ , seq.lengths := .N, by=.(gvkey,step.idx)] # length of each sequence
DT[, max.seq := max(seq.lengths), by = gvkey] # each firm's longest sequence
DT <- DT[max.seq == seq.lengths] # Keep only the longest sequence(s)
Now this is not the most efficient method since I make the copy above when removing the non-longest time series, and then do that again below when I keep on the most recent time series of equal-length maximum series -- but I don't think this should affect the functionality issue I'm having.
DT[, one.segment := 1*(max.seq == .N), by= gvkey] # 0 if there multiple series remain
DT[one.segment == 0, # make the last max.seq elements 1, leave the rest as 0
one.segment := c(rep(0, (.N-max.seq[1])), rep(1, max.seq[1])), by=gvkey]
EDITED to Report Full Output
I start with
nrow(DT) # [1] 98
DT[one.segment ==0, .N] # [1] 14
Then keep only the one.segment==1
rows.
DT.out <- DT[one.segment == 1] # Finished! ... or am I?
I should now have no one.segment == 0
cases left, but I do.
nrow(DT.out) # [1] 76
DT.out[one.segment ==0, .N] # [1] 13
But if I run the row removal command again then the problem is solved (both for this example and for my full data set nrow(DT)>35000
).
DT.out2 <- DT.out[one.segment == 1]
nrow(DT.out2) # [1] 63
DT.out[one.segment ==0, .N] # [1] 0
What am I missing?
Thanks!
** OUTPUT **
> DT.out
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment
1: 1312 1974 NA NA 0 13 13 1
2: 1312 1975 1974 1 0 13 13 1
3: 1312 1976 1975 1 0 13 13 1
4: 1312 1977 1976 1 0 13 13 1
5: 1312 1978 1977 1 0 13 13 1
6: 1312 1979 1978 1 0 13 13 1
7: 1312 1980 1979 1 0 13 13 1
8: 1312 1981 1980 1 0 13 13 1
9: 1312 1982 1981 1 0 13 13 1
10: 1312 1983 1982 1 0 13 13 1
11: 1312 1984 1983 1 0 13 13 1
12: 1312 1985 1984 1 0 13 13 1
13: 1312 1986 1985 1 0 13 13 1
14: 2090 1956 NA NA 0 28 28 1
15: 2090 1957 1956 1 0 28 28 1
16: 2090 1958 1957 1 0 28 28 1
17: 2090 1959 1958 1 0 28 28 1
18: 2090 1960 1959 1 0 28 28 1
19: 2090 1961 1960 1 0 28 28 1
20: 2090 1962 1961 1 0 28 28 1
21: 2090 1963 1962 1 0 28 28 1
22: 2090 1964 1963 1 0 28 28 1
23: 2090 1965 1964 1 0 28 28 1
24: 2090 1966 1965 1 0 28 28 1
25: 2090 1967 1966 1 0 28 28 1
26: 2090 1968 1967 1 0 28 28 1
27: 2090 1969 1968 1 0 28 28 1
28: 2090 1970 1969 1 0 28 28 1
29: 2090 1971 1970 1 0 28 28 1
30: 2090 1972 1971 1 0 28 28 1
31: 2090 1973 1972 1 0 28 28 1
32: 2090 1974 1973 1 0 28 28 1
33: 2090 1975 1974 1 0 28 28 1
34: 2090 1976 1975 1 0 28 28 1
35: 2090 1977 1976 1 0 28 28 1
36: 2090 1978 1977 1 0 28 28 1
37: 2090 1979 1978 1 0 28 28 1
38: 2090 1980 1979 1 0 28 28 1
39: 2090 1981 1980 1 0 28 28 1
40: 2090 1982 1981 1 0 28 28 1
41: 2090 1983 1982 1 0 28 28 1
42: 2212 1982 NA NA 0 3 3 1
43: 2212 1983 1982 1 0 3 3 1
44: 2212 1984 1983 1 0 3 3 1
45: 8344 1990 1987 3 1 6 6 1
46: 8344 1991 1990 1 1 6 6 1
47: 8344 1992 1991 1 1 6 6 1
48: 8344 1993 1992 1 1 6 6 1
49: 8344 1994 1993 1 1 6 6 1
50: 8344 1995 1994 1 1 6 6 1
51: 10589 1978 NA NA 0 2 2 0
52: 10589 1979 1978 1 0 2 2 0
53: 10589 1983 1979 4 1 2 2 1
54: 10589 1984 1983 1 1 2 2 1
55: 11759 1984 NA NA 0 1 1 0
56: 11759 1988 1984 4 1 1 1 1
57: 12675 1985 NA NA 0 3 3 0
58: 12675 1986 1985 1 0 3 3 0
59: 12675 1987 1986 1 0 3 3 0
60: 12675 2001 1987 14 1 3 3 1
61: 12675 2002 2001 1 1 3 3 1
62: 12675 2003 2002 1 1 3 3 1
63: 13910 1986 NA NA 0 1 1 0
64: 13910 1989 1986 3 1 1 1 1
65: 17286 1989 NA NA 0 6 6 0
66: 17286 1990 1989 1 0 6 6 0
67: 17286 1991 1990 1 0 6 6 0
68: 17286 1992 1991 1 0 6 6 0
69: 17286 1993 1992 1 0 6 6 0
70: 17286 1994 1993 1 0 6 6 0
71: 17286 2001 1994 7 1 6 6 1
72: 17286 2002 2001 1 1 6 6 1
73: 17286 2003 2002 1 1 6 6 1
74: 17286 2004 2003 1 1 6 6 1
75: 17286 2005 2004 1 1 6 6 1
76: 17286 2006 2005 1 1 6 6 1
gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment
** Session Info ***
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6
loaded via a namespace (and not attached):
[1] tools_3.2.3 chron_2.3-47
来源:https://stackoverflow.com/questions/36238521/selected-rows-in-data-table-not-being-removed-first-time-must-remove-twice