Colouring a PCA plot by clusters in R

问题

I have some biological data that looks like this, with 2 different types of clusters (A and B):

                Cluster_ID       A1      A2      A3       B1       B2      B3
 5  chr5:100947454..100947489,+   3.31322  7.52365  3.67255  21.15730  8.732710 17.42640
12 chr5:101227760..101227782,+   1.48223  3.76182  5.11534  15.71680  4.426170 13.43560
29 chr5:102236093..102236457,+  15.60700 10.38260 12.46040   6.85094 15.551400  7.18341

I clean up the data:

CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)

#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]

#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)

The data size is reduced from 6981 clusters to 599 after this.

I then go on to apply PCA:

#Get data dimensions

dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE) 
summary(PCA.CAGE)

I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.

This is what I have tried, to erroneous results:

qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))

ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()

qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))

The errors appear as such:

  > qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")

  Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]

  > qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data,    data=as.data.frame(PCA.CAGE$x))

  Don't know how to automatically pick scale for object of type data.frame.   Defaulting to continuous
  Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
  Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data

 > ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more,      label=CAGE_filter_more)) + geom_point() + geom_text()

 Error: ggplot2 doesn't know how to deal with data of class

回答1:

Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.

pca   <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)

So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

来源：https://stackoverflow.com/questions/32259976/colouring-a-pca-plot-by-clusters-in-r

标签

pca