问题
before I start here is a a small subset of the data I'm working with, i apologize in advance for it being so large (note this is only the first 30 rows of an extremely large dataset:
mydata<-structure(list(ParkName = c("SEP", "CSSP",
"SEP", "ONF", "SEP",
"ONF", "SEP",
"CSSP", "ONF",
"SEP", "CSSP",
"PPRSP", "PPRSP",
"SEP", "ONF",
"PPRSP", "ONF",
"SEP", "SEP",
"ONF"),
Year = c(2001, 2005, 1998,2011, 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992, 1992,
1992, 1992, 1992, 1993, 1994, 1994),
LatinName = c("Mola mola", "Clarias batrachus", "Lithobates catesbeianus", "Rana catesbeiana", "Rana catesbeiana",
"Rana yellowis", "Rana catesbeiana", "Solenopsis sp1","Rana catesbeiana", "Rana catesbeiana",
"Pratensis", "Rana catesbeiana", "Rana catesbeiana", "sp2", "Orchidaceae",
"Rana catesbeiana","Formica", "Rana catesbeiana", "Rana catesbeiana", "sp2"),
NumTotal = c(1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 100, 2, 1, 2)), Names = c("ParkName", "Year", "LatinName",
"NumTotal"),
row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
This dataset represents the abundance of different species in different parks over a multitude of years. What I essentially want to do with this data is to get a species X park matrix for every year that data was recorded and then youse the 'vegan' package to calculate diversity indices for each park for each year. Obviously this is not a balanced dataset as not every park recorded species abundance for every year etc. Now I've realized to do this I need to run loops. I would need to get a list of parks per year and a list of species and their abundance per park per year in order to create these matrices. I'm not the greatest when it comes to running loops and this task is confusing me. For example, I created a separate vector of unique years in the dataset. I then created an empty list called "parkbyyear" to fill up with a list of parks by year from the main dataframe
year<-as.vector(unique(data[,3]))
parkbyyear<-NULL
for (i in 1:year) {
parkbyyear[i]<- mydata[mydata$ParkName[year == "i"]
}
The loop fails to run. Any help would be appreciated.
回答1:
Simply use by
to slice a dataframe by needed factor(s) and run operations like vector return:
parkbyyear_list <- by(mydata, mydata$Year, FUN=function(df) df$ParkName)
parkbyyear_list
# mydata$Year: 1991
# [1] "SEP" "ONF" "SEP" "CSSP" "ONF"
# ---------------------------------------------------------------------------
# mydata$Year: 1992
# [1] "SEP" "CSSP" "PPRSP" "PPRSP" "SEP" "ONF" "PPRSP" "ONF"
# ---------------------------------------------------------------------------
# mydata$Year: 1993
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 1994
# [1] "SEP" "ONF"
# ---------------------------------------------------------------------------
# mydata$Year: 1998
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 2001
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 2005
# [1] "CSSP"
# ---------------------------------------------------------------------------
# mydata$Year: 2011
# [1] "ONF"
For a list of subsetted dataframes by Year, simply use split
(or by
again):
dfList <- split(mydata, mydata$Year)
# dfList <- by(mydata, mydata$Year, FUN=function(df) df) # SIMILAR CALL
dfList
# $`1991`
# ParkName Year LatinName NumTotal
# 5 SEP 1991 Rana catesbeiana 2
# 6 ONF 1991 Rana yellowis 1
# 7 SEP 1991 Rana catesbeiana 1
# 8 CSSP 1991 Solenopsis sp1 1
# 9 ONF 1991 Rana catesbeiana 1
# $`1992`
# ParkName Year LatinName NumTotal
# 10 SEP 1992 Rana catesbeiana 1
# 11 CSSP 1992 Pratensis 1
# 12 PPRSP 1992 Rana catesbeiana 1
# 13 PPRSP 1992 Rana catesbeiana 1
# 14 SEP 1992 sp2 1
# 15 ONF 1992 Orchidaceae 1
# 16 PPRSP 1992 Rana catesbeiana 1
# 17 ONF 1992 Formica 100
#
# $`1993`
# ParkName Year LatinName NumTotal
# 18 SEP 1993 Rana catesbeiana 2
#
# $`1994`
# ParkName Year LatinName NumTotal
# 19 SEP 1994 Rana catesbeiana 1
# 20 ONF 1994 sp2 2
#
# $`1998`
# ParkName Year LatinName NumTotal
# 3 SEP 1998 Lithobates catesbeianus 1
#
# $`2001`
# ParkName Year LatinName NumTotal
# 1 SEP 2001 Mola mola 1
#
# $`2005`
# ParkName Year LatinName NumTotal
# 2 CSSP 2005 Clarias batrachus 1
#
# $`2011`
# ParkName Year LatinName NumTotal
# 4 ONF 2011 Rana catesbeiana 1
来源:https://stackoverflow.com/questions/47124154/how-to-get-multiple-matrices-from-large-data-sets-based-on-year