问题
I am new to R and the clustering world. I am using a shopping dataset to extract features from it in order to identify something meaningful.
So far I have managed to learn how to merge files, remove na., do the sum of errors squared, workout the mean values, summarise by group, do the K means clustering and plot the results X, Y.
However, I am very confused on how to view these results or identify what would be a useful cluster? Am i repeating something or missing out on something? I get confused with plotting X Y variables aswell.
Below is my code, maybe my code might be wrong. Could you please help. Any help would be great.
# Read file
mydata = read.csv(file.choose(), TRUE)
#view the file
View(mydata)
#create new data set
mydata.features = mydata
mydata.features <- na.omit(mydata.features)
wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(mydata.features, 3)
# get cluster means
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)
results <- kmeans(mydata.features, 3)
plot(mydata[c("DAY","WEEK_NO")], col= results$cluster
Sample data Variables, below are all the variables I have within my dataset, its shopping dataset collected over 2 years
PRODUCT_ID - uniquely identifies each product household_key - uniquely identifies each household BASKET_ID - uniquely identifies a purchase occasion DAY - day when transaction occured QUANTITY - number of products purchased during the trip SALES_VALUE - amount of dollar retailers receive from sales STORE_ID - identifies unique stores RETAIL_DISC - disccount applied due to manufacture coupon TRANS_TIME - time of day when the transaction occurred WEEK_NO - week of transaction occurred 1-102 MANUFACTURER - code that links products with same manufacture together DEPARTMENT - groups similar products together BRAND - indicates private or national label band COMMODITY_DESC - groups similar products together at the lower level SUB_COMMODITY_DESC - groups similar products together at the lowest level
回答1:
Sample Data
I put together some sample data, so I can help you better:
#generate sample data
sampledata <- matrix(data=rnorm(200,0,1),50,4)
#add ID to data
sampledata <-cbind(sampledata, 1:50)
#show data:
head(sampledata)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.72859559 -2.2864943 -0.5408501 0.1564730 1
[2,] 0.34852943 0.3100891 0.6007349 -0.5985266 2
[3,] -0.04605026 0.5067896 -0.2911211 -1.1617171 3
[4,] -1.88358617 1.3739440 -0.5655383 0.9518367 4
[5,] 0.35528650 -1.7482304 -0.3871520 -0.7837712 5
[6,] 0.38057682 0.1465488 -0.6006462 1.3827544 6
I have a matrix with data points. Each data point has 4 variables (column 1 - 4) and an id (column 5).
Apply K-means
After that I apply the k-means function (but only to column 1:4 since it doesnt make much sense to cluster the id):
#kmeans (4 centers)
result <- kmeans(sampledata[,1:4], 4)
Analyse output
if i want to see what data point belongs to which cluster i can type:
result$cluster
The result will be for example:
[1] 4 3 2 2 1 2 4 4 3 3 3 3 2 1 4 4 4 2 4 4 4 1 1 1 3 3 3 3 1 3 2 2 4 4 2 4 2 3 1 2 2 2 1 2 1 1 4 1 1 1
This means that data point 1 belongs to cluster 4. The second data point belongs to cluster 3, and so on... If I want to retrieve all data points that are in cluster 1, i can do the following:
sampledata[result$cluster==1,]
This will output a matrix, with all the values and the Data Point Id in the last Column:
[,1] [,2] [,3] [,4] [,5]
[1,] 0.3552865 -1.748230422 -0.3871520 -0.78377121 5
[2,] 0.5806156 0.479576142 1.1314052 1.60730796 14
[3,] 1.1871472 1.280881477 -1.7227361 -0.89045074 22
[4,] 0.8482060 0.726470349 0.6851352 -0.78526581 23
[5,] -0.5324139 -1.745802580 0.6779943 0.99915708 24
[6,] 0.2472263 -0.006298136 -0.1457003 -0.44789364 29
[7,] 0.1412812 -0.247076976 0.9181507 -0.58570904 39
[8,] 0.1859786 -1.768692166 0.5681229 -0.80618157 43
[9,] -1.1577178 -0.179886998 1.5183880 0.40014071 45
[10,] 1.0667566 -1.602875994 0.6010581 -0.49514049 46
[11,] 0.2464646 1.226129859 -1.3628096 -0.37666716 48
[12,] 1.2660358 0.282688323 0.7650636 0.23442255 49
[13,] -0.2499337 0.855327072 0.2290221 0.03492807 50
If i want to know how many data points are in cluster 1, I can type:
sum(result$cluster==1)
This will return 13, and corresponds to the number of lines in the matrix above.
Finally some plotting:
First, lets plot the data. Since you have a multidimensional dataframe, and you can only plot two dimensions in a standard plot, you have to do it like this. Select the variables you want to plot, For example var 2 and 3 (column 2 and 3). This corresponds to:
sampledata[,2:3]
To plot this data, simply write:
plot(sampledata[,2:3], col=result$cluster ,main="Affiliation of observations")
use the argumemnt col (this stands for colors) to give the data points a color accordingly to their cluster affiliation by typing col= result$cluster
If you also want to see the cluster centers in the plot, add the following line:
+ points(result$centers, col=1:4, pch="x", cex=3)
The plot should now look like this (for variable 2 vs variable 3):
(The dots are the data points, the X´s are the cluster centers)

回答2:
I am not really familiar with the k-means function, and its hard to help without any sample data. Here however is something that might help:
kmeans returns an object of class "kmeans" which has a print and a fitted method. It is a list with at least the following components:
- cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
- centers: A matrix of cluster centres.
- totss: The total sum of squares.
- withinss: Vector of within-cluster sum of squares, one component per cluster.
- tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
- betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss.
- size: The number of points in each cluster.
- iter: The number of (outer) iterations.
- ifault: integer: indicator of a possible algorithm problem – for experts.
more here.
You can access these components like this: I.e. if you want to have a look at the clusters:
results$cluster
Or have more details about the centers:
results$centers
来源:https://stackoverflow.com/questions/28572746/kmeans-clustering-identifying-knowledge-in-r