问题
I have been running Louvain community detection in R using igraph, with thanks to this answer for my previous query. However, I found that the cluster_louvain
method seemed to do something strange with assigning group membership, which I think was due to an error in how I imported my data. Whilst I think I resolved this I would like to understand what the problem was.
I ran louvain clustering on a 400x400 correlation matrix (i.e. correlation scores for 400 individuals). When I initially imported my data, my correlation matrix had the same individuals’ ID numbers (i.e. vertex numbers) for both the row and column headings, as below:
1 2 3 4 ... 400
1 0 0.8 0.7 0.1
2 0.8 0 0.6 0.3
3 0.7 0.6 0 0.9
4 0.1 0.3 0.9 0
...
400
This correlation matrix was saved in a "Correlations.csv" file, which I imported using read.csv
. I then used the below code to convert it to a distance matrix, remove correlations below a certain threshold, turn it into an adjacency matrix for igraph, and run cluster_louvain: (This code is also provided in the answer here).
correlationmatrix <- read.csv("Correlations.csv", header = TRUE,
row.name = 1, check.names = FALSE)
distancematrix <- cor2dist(correlationmatrix)
DM2<- as.matrix(distancematrix)
DM2[correlationmatrix < 0.33] = 0
G2 <- graph.adjacency(DM2, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain <- cluster_louvain(G2)
sizes(clusterlouvain)
Community sizes
1 2
200 200
I then wanted to get the cluster number beside each ID number, to know which individual belonged to each community. So I used IDs_cluster <- cbind(V(G2)$name, clusterlouvain$membership)
. This gave the list of vertex IDs but the membership beside them was listed as ‘1 2 1 2 1 2 1 2’, which obviously was not right (as we would not expect every alternate individual in the dataset to be assigned to a different community):
ID Membership
1 1
2 2
3 1
4 2
5 1
6 2
…
400 2
From looking at other datasets I realised the problem might have been because the row headings in my correlation matrix were numerical. So I changed the correlation matrix so that the row headings were still the ID numbers, but the column headings were `V1-V400':
V1 V2 V3 V4 ... V400
1 0 0.8 0.7 0.1
2 0.8 0 0.6 0.3
3 0.7 0.6 0 0.9
4 0.1 0.3 0.9 0
...
40
I imported this as a .csv file and re-ran ‘cluster_louvain’, as below:
correlationmatrix_V <- read.csv("Correlations_withV.csv", header = TRUE,
row.name = 1, check.names = FALSE)
distancematrix_V <- cor2dist(correlationmatrix_V)
DM2_V <- as.matrix(distancematrix_V)
DM2_V[correlationmatrix_V < 0.33] = 0
G2_V <- graph.adjacency(DM2_V, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain_V <- cluster_louvain(G2_V)
Now when I reran cluster_louvain
, it generated a more sensible result of three clusters, with group membership to each cluster looking more like what we would expect:
sizes(clusterlouvain_V)
Community sizes
1 2 3
168 52 180
IDs_cluster <- cbind(V(G2_V)$name, clusterlouvain_V$membership)
View(IDs_cluster)
ID Membership
1 1
2 1
3 3
4 2
5 2
6 2
…
400 1
My question is: May it be possible to clarify what happened when using the same row and column headings, that meant group membership was assigned to alternate individuals (i.e. '1 2 1 2' down the ID list, as in the first example), but was resolved when changing the column headings to a non-numerical format (as in the second example)?
This may be a simple mistake in that when importing the .csv of the correlation matrix using ‘read.csv’ I did not use the correct settings, given my column headings were also numerical.
However, would like to understand why this meant ‘cluster_louvain’ assigned group membership in the way it did. I am posting this in case it may be useful if anyone makes the same mistake I did above. Any insights would be welcome, and thank you for any advice!
来源:https://stackoverflow.com/questions/49856205/louvain-community-detection-in-r-using-igraph-assigns-alternating-group-member