Restructuring a data frame for 3D plots in R

问题

I realize often times that 3D plots are not the most efficient way to present a set of data, but previous 2D plots I've made for a particular dataset seem to indicate that a 3D plot would help to break the information into more distinct clusters for analysis. That being said, I've never done this in R and I'm having trouble restructuring my data frame before making a 3D scatterplot using plot3d().

At the moment, my data frame has 2 columns and a few thousand rows of information. Column one is an identifier, A,B,C... and Column 2 is one measured feature for that identifier.

ID Area 
A   1.2
A   3.0
A   2.7
B   1.4
B   2.5
C   4.3
C   2.1
C   1.7

I will plot the area on the Y axis. Using a function like table(), I can get the number of times A, B, or C occur: (A=3,B=2,C=3) and this value will become the x coordinate for all the IDs with that result. But what I would like to do is have that information also put into a third column that assigns a unique z for the given x coordinate. In other words, Z should represent how many times a given X has shown up, and would increase by 1 for each new instance of a particular X. Ultimately, the reason is so that area values (y) for all the objects within a particular ID are stacked above each other over a unique x,z coordinate. This is where I am stuck. Essentially, I would want the final data frame output given the above input to look like this:

ID(x) Area(y)  Z
    3    1.2   1
    3    3.0   1
    3    2.7   1
    2    1.4   1
    2    2.5   1
    3    4.3   2
    3    2.1   2
    3    1.7   2

回答1:

We could do this in a couple of ways.

1. base R - aggregate/ave

We can use aggregate to get the length of each elements ('IDx') in 'ID' column, transform the output dataset ('dfN') by creating the 'Z' column based on the duplicate elements in the 'IDx' and 'merge' the 'dfN' with the original dataset 'df1'

dfN <- aggregate(cbind(IDx=seq_along(ID))~ID, df1, FUN=length)
dfN$Z <- with(dfN, ave(IDx, IDx, FUN=function(x) cumsum(duplicated(x))+1L))
 merge(df1, dfN, by='ID')[-1]
 #  Area IDx Z
 #1  1.2   3 1
 #2  3.0   3 1
 #3  2.7   3 1
 #4  1.4   2 1
 #5  2.5   2 1
 #6  4.3   3 2
 #7  2.1   3 2
 #8  1.7   3 2

2. base R - ave/rle

We can create the 'IDx' column with ave and then use `rle/inverse.rle' to create the 'Z' column

 df1$IDx <- with(df1, ave(seq_along(ID), ID, FUN=length))
 v1 <- with(df1, paste0(ID, IDx))
 df1$Z <- inverse.rle(within.list(rle(v1), values <-ave(lengths, 
             lengths, FUN=function(x) cumsum(duplicated(x))+1L)))
 df1
 #  ID Area IDx Z
 #1  A  1.2   3 1
 #2  A  3.0   3 1
 #3  A  2.7   3 1
 #4  B  1.4   2 1
 #5  B  2.5   2 1
 #6  C  4.3   3 2
 #7  C  2.1   3 2
 #8  C  1.7   3 2

3. data.table

Convert the 'data.frame' to 'data.table' (setDT), create the 'IDx' i.e the nrows (.N), grouped by 'ID'. Based on the duplicate elements in 'IDx', we can create the 'Z' column. Set the key as 'ID' (setkey), join with 'df1', and assign the unnecessary column to NULL (ID:= NULL)

library(data.table)
setkey(setDT(df1)[, list(IDx=.N), by = ID][, IDx1:= IDx][,
     list(ID,Z=cumsum(duplicated(IDx1))+1L) , IDx], ID)[df1][, ID := NULL][]

#   IDx Z Area
#1:   3 1  1.2
#2:   3 1  3.0
#3:   3 1  2.7
#4:   2 1  1.4
#5:   2 1  2.5
#6:   3 2  4.3
#7:   3 2  2.1
#8:   3 2  1.7

4. dplyr

The idea is similar as above. Instead of 'merge', we use left_join

library(dplyr)
left_join(df1, 
            df1 %>% 
              group_by(ID) %>% 
              summarise(IDx=n()) %>% 
              group_by(IDx) %>%
              mutate(Z=cumsum(duplicated(IDx))+1L), by='ID') %>% 
              select(-ID)
 #  Area IDx Z
 #1  1.2   3 1
 #2  3.0   3 1
 #3  2.7   3 1
 #4  1.4   2 1
 #5  2.5   2 1
 #6  4.3   3 2
 #7  2.1   3 2
 #8  1.7   3 2

NOTE: Tested this with another dataset 'df2'

data

df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C", "C", "C"), 
Area = c(1.2, 3, 2.7, 1.4, 2.5, 4.3, 2.1, 1.7)), .Names = c("ID", 
"Area"), class = "data.frame", row.names = c(NA, -8L))

df2 <-  structure(list(ID = c("A", "A", "A", "B", "B", "C", "C", "C", 
"D", "D", "D", "E", "E", "F"), Area = c(1.2, 3, 2.7, 1.4, 2.5, 
4.3, 2.1, 1.7, 1.2, 1.4, 2.1, 1.2, 1.5, 2.3)), .Names = c("ID", 
"Area"), class = "data.frame", row.names = c(NA, -14L))

来源：https://stackoverflow.com/questions/29337599/restructuring-a-data-frame-for-3d-plots-in-r

标签

dataframe

scatter-plot