Extracting row from a data frame according a criterion based if values through rows changed or not

问题

I have unsuccessfully tried to do the task described below, so any help will be much appreciated.

The largest table below contains data of quota ownership of fishers (and other variable, ’cpue’) across the time. I categorized fishers according the number of quotas that they own (‘category’). Fishers may increase or reduce the number of owned quotas; therefore, their ownership category also may change. I need extract the information every time when fishers change their ownership. It is the row of the year before when the number of quota was already increased or decreased. For instance, if the number of quotas was 20 and 45 during the years 2000 and 2001 respectively, I need the information (row) of the year 2000. Additionally, I need a new column with a category to indicate amongst what ownership levels fishers are moving. The second table below shows the new data frame that I need create with the extracted rows.

My data:

ID  fisher  year    qtty    category    cpue
1   1   1998    13  1   0.5994452
2   1   1999    13  1   0.6176183
3   1   2000    13  1   0.6871764
4   1   2001    20  2   0.3228005
5   1   2002    20  2   0.6505336
6   1   2003    20  2   0.8615834
7   1   2004    20  2   0.6871764
8   1   2005    20  2   0.7469739
9   1   2006    20  2   0.7380952
10  1   2007    45  3   0.7516396
11  1   2008    45  3   0.6808454
12  1   2009    45  3   0.6734158
13  1   2010    45  3   0.70367
14  1   2011    45  3   0.5434572
15  1   2012    45  3   0.6181238
16  2   2000    50  3   0.5191856
17  2   2001    50  3   0.6098226
18  2   2002    50  3   1.0018519
19  2   2003    50  3   1.2049724
20  2   2004    50  3   0.5857708
21  2   2005    10  1   0.6744186
22  2   2006    10  1   0.8123333
23  2   2007    10  1   0.3228005
24  2   2008    10  1   0.6505336
25  2   2009    10  1   0.8615834
26  2   2010    0   4   0
27  3   1998    25  2   0.7469739
28  3   1999    25  2   0.7380952
29  3   2000    25  2   0.7516396
30  3   2001    25  2   0.6808454
31  3   2002    10  1   0.6734158
32  3   2003    10  1   0.70367
33  3   2004    10  1   0.5434572
34  3   2005    10  1   0.6181238
35  3   2006    45  3   0.4698849
36  3   2007    45  3   1.0714286
37  3   2008    45  3   1.242439
38  3   2009    45  3   1.0614261
39  3   2010    45  3   0.9761391
40  3   2011    45  3   1.0041898
41  3   2012    45  3   0.9429851
42  4   2005    45  3   0.9310958
43  4   2006    50  3   0.8932985
44  4   2007    50  3   0.7867613
45  4   2008    20  2   0.7994713
46  4   2009    20  2   0.9368927
47  4   2010    10  1   0.8123333
48  4   2011    0   4   0
49  5   1998    20  2   0.4698849
50  5   1999    20  2   1.0714286
51  5   2000    20  2   1.242439
52  5   2001    20  2   1.0614261
53  5   2002    20  2   0.9761391
54  5   2003    20  2   1.0041898
55  5   2004    20  2   0.7469739
56  5   2005    0   4   0.7380952
57  6   2000    55  3   0.7516396
58  6   2001    55  3   0.6808454
59  6   2002    55  3   0.6734158
60  6   2003    55  3   0.6505336
61  6   2004    55  3   0.8615834
62  6   2005    55  3   0.6871764
63  6   2006    55  3   0.6181238
64  6   2007    0   4   0

This is what I need:

ID  fisher  year    qtty    category    cpue    category2
3   1   2000    13  1   0.6871764   1
25  2   2009    10  1   0.8615834   1
34  3   2005    10  1   0.6181238   1
47  4   2010    10  1   0.8123333   1
9   1   2006    20  2   0.7380952   2
30  3   2001    25  2   0.6808454   3
46  4   2009    20  2   0.9368927   3
44  4   2007    50  3   0.7867613   4
20  2   2004    50  3   0.5857708   5
25  2   2009    10  1   0.8615834   6
47  4   2010    10  1   0.8123333   6
55  5   2004    20  2   0.7469739   7
63  6   2006    55  3   0.6181238   8

The ownership categories are 1 (1-15 quotas), 2 (16-40 quotas), 3(>40 quotas) and 4(0 quotas, those who exited the fishery). The new category that I need should show the transition amongst the different ownership categories (e.g. category 1 is the transition from the ownership level 1 to the ownership level 2). Full details in the following table:

From    to  category2
1   2   1
2   3   2
2   1   3
3   2   4
3   1   5
1   0   6
2   0   7
3   0   8

Thanks!!

回答1:

With data as your first data frame and cats as the category table:

> w<-which(diff(data$fisher)==0 & diff(data$category)!= 0)
> merge(data.frame(data[w,],From=data$category[w],to=data$category[w+1]),cats,all.x=T)[,-(1:2)]
   ID fisher year qtty category      cpue category2
1   3      1 2000   13        1 0.6871764         1
2  34      3 2005   10        1 0.6181238        NA
3  25      2 2009   10        1 0.8615834         6
4  47      4 2010   10        1 0.8123333         6
5  46      4 2009   20        2 0.9368927         3
6  30      3 2001   25        2 0.6808454         3
7   9      1 2006   20        2 0.7380952         2
8  55      5 2004   20        2 0.7469739         7
9  20      2 2004   50        3 0.5857708         5
10 44      4 2007   50        3 0.7867613         4
11 63      6 2006   55        3 0.6181238         8

回答2:

This should work for you, if I understood your problem correctly. df is the big dataset you've shown in your question -

library(data.table)
dt <- data.table(df)
dt[,qttychange := diff(qtty), by = "fisher"]
categorychanges <- dt[qttychange != 0]

dt[,nextcategory := c(tail(category,-1),NA)]
dt[qttychange == 0 ,nextcategory := NA]
categorytable <- dt[!is.na(nextcategory),list(freq = .N), by = c("category","nextcategory")]

Output -

> categorychanges
    ID fisher year qtty category      cpue qttychange
 1:  3      1 2000   13        1 0.6871764          7
 2:  9      1 2006   20        2 0.7380952         25
 3: 20      2 2004   50        3 0.5857708        -40
 4: 25      2 2009   10        1 0.8615834        -10
 5: 30      3 2001   25        2 0.6808454        -15
 6: 34      3 2005   10        1 0.6181238         35
 7: 42      4 2005   45        3 0.9310958          5
 8: 44      4 2007   50        3 0.7867613        -30
 9: 46      4 2009   20        2 0.9368927        -10
10: 47      4 2010   10        1 0.8123333        -10
11: 48      4 2011    0        4 0.0000000          5
12: 55      5 2004   20        2 0.7469739        -20
13: 63      6 2006   55        3 0.6181238        -55
> categorytable
    category nextcategory freq
 1:        1            2    1
 2:        2            3    1
 3:        3            1    1
 4:        1            4    2
 5:        2            1    2
 6:        1            3    1
 7:        3            3    1
 8:        3            2    1
 9:        4            2    1
10:        2            4    1
11:        3            4    1

回答3:

The output you provide is a bit inconsistent, i.e. there are some duplicate rows and some mismatches between the category2 you provide and the category2 you output.

Also, the last dataframe which shows the category2 (i) has 0 which you have not mentioned as a category of quotas, (ii) does not provide category2 for the 1 to 3 transition. So, I changed 0 with 4, and added a category2 for the 1 to 3 transition.

I hope I've not misunderstood, but the result looks similar to what you expect:

library(zoo)

newDF <- do.call(rbind, lapply(split(DF, DF$fisher), 
                   function(x) { res <- x[diff(x$category) != 0,] ;
                       aa <- unique(x$category) ; 
                          cbind(res, rollapply(unique(x$category), width = 2, c)) }))

newDF$category2 <- unlist(apply(newDF[,c("1", "2")], 1, 
     function(x) trans$category2[grep(paste(x, collapse = " to "), 
            paste(trans$From, trans$to, sep = " to "))]), use.names = F)

newDF
#     ID fisher year qtty category      cpue 1 2 category2
#1.3   3      1 2000   13        1 0.6871764 1 2         1
#1.9   9      1 2006   20        2 0.7380952 2 3         2
#2.20 20      2 2004   50        3 0.5857708 3 1         5
#2.25 25      2 2009   10        1 0.8615834 1 4         6
#3.30 30      3 2001   25        2 0.6808454 2 1         3
#3.34 34      3 2005   10        1 0.6181238 1 3 not given
#4.44 44      4 2007   50        3 0.7867613 3 2         4
#4.46 46      4 2009   20        2 0.9368927 2 1         3
#4.47 47      4 2010   10        1 0.8123333 1 4         6
#5    55      5 2004   20        2 0.7469739 2 4         7
#6    63      6 2006   55        3 0.6181238 3 4         8

Columns 1 and 2 of newDF is the "from - to" transition.

DF is your large dataframe and trans is your last dataframe with the transitions (as I changed it):

DF <- structure(list(ID = 1:64, fisher = c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), year = c(1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 1998L, 1999L, 2000L, 
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 
2011L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L), qtty = c(13L, 
13L, 13L, 20L, 20L, 20L, 20L, 20L, 20L, 45L, 45L, 45L, 45L, 45L, 
45L, 50L, 50L, 50L, 50L, 50L, 10L, 10L, 10L, 10L, 10L, 0L, 25L, 
25L, 25L, 25L, 10L, 10L, 10L, 10L, 45L, 45L, 45L, 45L, 45L, 45L, 
45L, 45L, 50L, 50L, 20L, 20L, 10L, 0L, 20L, 20L, 20L, 20L, 20L, 
20L, 20L, 0L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 0L), category = c(1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 4L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 4L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L), 
    cpue = c(0.5994452, 0.6176183, 0.6871764, 0.3228005, 0.6505336, 
    0.8615834, 0.6871764, 0.7469739, 0.7380952, 0.7516396, 0.6808454, 
    0.6734158, 0.70367, 0.5434572, 0.6181238, 0.5191856, 0.6098226, 
    1.0018519, 1.2049724, 0.5857708, 0.6744186, 0.8123333, 0.3228005, 
    0.6505336, 0.8615834, 0, 0.7469739, 0.7380952, 0.7516396, 
    0.6808454, 0.6734158, 0.70367, 0.5434572, 0.6181238, 0.4698849, 
    1.0714286, 1.242439, 1.0614261, 0.9761391, 1.0041898, 0.9429851, 
    0.9310958, 0.8932985, 0.7867613, 0.7994713, 0.9368927, 0.8123333, 
    0, 0.4698849, 1.0714286, 1.242439, 1.0614261, 0.9761391, 
    1.0041898, 0.7469739, 0.7380952, 0.7516396, 0.6808454, 0.6734158, 
    0.6505336, 0.8615834, 0.6871764, 0.6181238, 0)), .Names = c("ID", 
"fisher", "year", "qtty", "category", "cpue"), class = "data.frame", row.names = c(NA, 
-64L))

trans <- structure(list(From = c("1", "2", "2", "3", "3", "1", "2", "3", 
"1"), to = c("2", "3", "1", "2", "1", "4", "4", "4", "3"), category2 = c("1", 
"2", "3", "4", "5", "6", "7", "8", "not given")), .Names = c("From", 
"to", "category2"), row.names = c(NA, 9L), class = "data.frame")

来源：https://stackoverflow.com/questions/19743957/extracting-row-from-a-data-frame-according-a-criterion-based-if-values-through-r

标签

function

extract