How to get multiple matrices from large data sets based on year

房东的猫 提交于 2019-12-31 03:56:04

问题


before I start here is a a small subset of the data I'm working with, i apologize in advance for it being so large (note this is only the first 30 rows of an extremely large dataset:

mydata<-structure(list(ParkName = c("SEP", "CSSP", 
                        "SEP", "ONF", "SEP", 
                        "ONF", "SEP", 
                        "CSSP", "ONF", 
                        "SEP", "CSSP", 
                        "PPRSP", "PPRSP", 
                        "SEP", "ONF", 
                        "PPRSP", "ONF", 
                        "SEP", "SEP", 
                        "ONF"), 
           Year = c(2001, 2005, 1998,2011, 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992, 1992,
                                          1992, 1992, 1992, 1993, 1994, 1994), 
           LatinName = c("Mola mola", "Clarias batrachus", "Lithobates catesbeianus", "Rana catesbeiana", "Rana catesbeiana", 
                         "Rana yellowis", "Rana catesbeiana", "Solenopsis sp1","Rana catesbeiana", "Rana catesbeiana",
                         "Pratensis", "Rana catesbeiana",  "Rana catesbeiana", "sp2", "Orchidaceae",
                         "Rana catesbeiana","Formica", "Rana catesbeiana", "Rana catesbeiana", "sp2"), 
           NumTotal = c(1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 100, 2, 1, 2)), Names = c("ParkName", "Year", "LatinName", 
                                                                                                                      "NumTotal"),
      row.names = c(NA, -20L), class = c("tbl_df", "tbl",  "data.frame"))

This dataset represents the abundance of different species in different parks over a multitude of years. What I essentially want to do with this data is to get a species X park matrix for every year that data was recorded and then youse the 'vegan' package to calculate diversity indices for each park for each year. Obviously this is not a balanced dataset as not every park recorded species abundance for every year etc. Now I've realized to do this I need to run loops. I would need to get a list of parks per year and a list of species and their abundance per park per year in order to create these matrices. I'm not the greatest when it comes to running loops and this task is confusing me. For example, I created a separate vector of unique years in the dataset. I then created an empty list called "parkbyyear" to fill up with a list of parks by year from the main dataframe

year<-as.vector(unique(data[,3]))
parkbyyear<-NULL

for (i in 1:year) {
  parkbyyear[i]<- mydata[mydata$ParkName[year == "i"]
}

The loop fails to run. Any help would be appreciated.


回答1:


Simply use by to slice a dataframe by needed factor(s) and run operations like vector return:

parkbyyear_list <- by(mydata, mydata$Year, FUN=function(df) df$ParkName)

parkbyyear_list
# mydata$Year: 1991
# [1] "SEP"  "ONF"  "SEP"  "CSSP" "ONF" 
# ---------------------------------------------------------------------------
# mydata$Year: 1992
# [1] "SEP"   "CSSP"  "PPRSP" "PPRSP" "SEP"   "ONF"   "PPRSP" "ONF"  
# --------------------------------------------------------------------------- 
# mydata$Year: 1993
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 1994
# [1] "SEP" "ONF"
# ---------------------------------------------------------------------------
# mydata$Year: 1998
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 2001
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 2005
# [1] "CSSP"
# ---------------------------------------------------------------------------
# mydata$Year: 2011
# [1] "ONF"

For a list of subsetted dataframes by Year, simply use split (or by again):

dfList <- split(mydata, mydata$Year)
# dfList <- by(mydata, mydata$Year, FUN=function(df) df)   # SIMILAR CALL

dfList

# $`1991`
#   ParkName Year        LatinName NumTotal
# 5      SEP 1991 Rana catesbeiana        2
# 6      ONF 1991    Rana yellowis        1
# 7      SEP 1991 Rana catesbeiana        1
# 8     CSSP 1991   Solenopsis sp1        1
# 9      ONF 1991 Rana catesbeiana        1

# $`1992`
#    ParkName Year        LatinName NumTotal
# 10      SEP 1992 Rana catesbeiana        1
# 11     CSSP 1992        Pratensis        1
# 12    PPRSP 1992 Rana catesbeiana        1
# 13    PPRSP 1992 Rana catesbeiana        1
# 14      SEP 1992              sp2        1
# 15      ONF 1992      Orchidaceae        1
# 16    PPRSP 1992 Rana catesbeiana        1
# 17      ONF 1992          Formica      100
# 
# $`1993`
#    ParkName Year        LatinName NumTotal
# 18      SEP 1993 Rana catesbeiana        2
# 
# $`1994`
#    ParkName Year        LatinName NumTotal
# 19      SEP 1994 Rana catesbeiana        1
# 20      ONF 1994              sp2        2
# 
# $`1998`
#   ParkName Year               LatinName NumTotal
# 3      SEP 1998 Lithobates catesbeianus        1
# 
# $`2001`
#   ParkName Year LatinName NumTotal
# 1      SEP 2001 Mola mola        1
# 
# $`2005`
#   ParkName Year         LatinName NumTotal
# 2     CSSP 2005 Clarias batrachus        1
# 
# $`2011`
#   ParkName Year        LatinName NumTotal
# 4      ONF 2011 Rana catesbeiana        1


来源:https://stackoverflow.com/questions/47124154/how-to-get-multiple-matrices-from-large-data-sets-based-on-year

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!