问题
This is a follow up question to a problem I've been working on for a while. I have two questions. One regards an algorithm that works on two tweets, that I revised to measure 10 tweets. I'm wondering what my revision is measuring. I get result, but I want it to measure several tweet's jaccard distances, not just return one value. Since it's returning one value, I think it's just adding everything up. The other question is about my attempt to create a For Loop and assign clusters.
I'm trying to find the Jaccard distance between a dataset of tweets, then cluster those tweets with the Kmeans algorithm.
This is where I'm retrieving the data from: http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
What I have so far is this
install.packages("rjson")
library("rjson")
#download JSON File and put into a dataframe
download.file("http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json", tf<-tempfile());library(jsonlite);json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file(tf)),collapse=",")))
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
# *Works. Compares two tweets and measures Jaccard Distance
tweetText <- list(tweet1 = tweet.features$text[1]:tweet.features$text[2])
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
All of that measures the jaccard distance of two specified tweets. Which is great.
But now I'm trying to modify to compare the distances between several tweets. This time 10 random tweets which I retrieved from the Sample command in R.
# Generates two sets of 5 random tweets
tweetText <- list(sample(tweet.features$text, replace = FALSE, size = 5), sample(tweet.features$text, replace = FALSE, size = 5))
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
This gives me results, but it can't be correct.
I'm trying to build an algorithm that can measure all the tweets, compare their jaccard distance, then cluster based on the Jaccard distance with Kmeans.
So for another attempt, I thought to make a For Loop.
I decided to make 10 cluster centers with the 10 random tweets
c <- sample(tweet.features$text, replace = FALSE, size = 10)
Now I did a For Loop hoping to measure the tweets which I figured I could assign to an array and cluster
#Algorithm attempt
for(i in tweet.features$text){
for (j in c){
i <- length(intersect(i, j))
u <- length(union(i, j))
j = i/u
}
#assign(my.array)
}
I don't believe that's doing anything useful, but it's an attempt to create a loop to measure the Jaccard distance.
I'm sorry that this is a loaded question. Any help would be appreciated as I'm a bit lost.
回答1:
In your first function, you are doing unlist
of your lists of words in your tweets, thus in tw1
and tw2
have the global lists of words and you cannot use them for your tweet-by-tweet Jaccard. You could accomplish it with a removing the unlist
, then tw1
and tw2
are lists of lists of terms, and you can compare them using mapply
. Something as follows.
jaccard_i <- function(tw1, tw2){
tw1 <- strsplit(tw1, " |\\.")
tw2 <- strsplit(tw2, " |\\.")
i <- mapply(function(tw1, tw2) {
length(intersect(tw1, tw2))
}, tw1=tw1, tw2=tw2)
u <- mapply(function(tw1, tw2) {
length(union(tw1, tw2))
}, tw1=tw1, tw2=tw2)
list(i=i, u=u, j=i/u)
}
Silly example:
> tw1 = c("we yes you no", "we are the people")
> tw2= c("we are the people", "we yes you no")
> tweetText = list(tw1, tw2)
> jaccard_i(tweetText[[1]], tweetText[[2]])
$i
[1] 1 1
$u
[1] 7 7
$j
[1] 0.1428571 0.1428571
As for the second part of your question, the double loop, an easy way to start addressing it would be like this,
tw = c("we yes you no", "we are the people")
lapply(tw, function(tweet1) {
lapply(tw, function(tweet2) {
jaccard_i(tweet1, tweet2)
})
})
With results that look like this,
[[1]]
[[1]][[1]]
[[1]][[1]]$i
[1] 4
[[1]][[1]]$u
[1] 4
[[1]][[1]]$j
[1] 1
[[1]][[2]]
[[1]][[2]]$i
[1] 1
[[1]][[2]]$u
[1] 7
[[1]][[2]]$j
[1] 0.1428571
[[2]]
[[2]][[1]]
[[2]][[1]]$i
[1] 1
[[2]][[1]]$u
[1] 7
[[2]][[1]]$j
[1] 0.1428571
[[2]][[2]]
[[2]][[2]]$i
[1] 4
[[2]][[2]]$u
[1] 4
[[2]][[2]]$j
[1] 1
where you should skip the values in the diagonals of course -just a starting point as I said.
Hope it helps.
来源:https://stackoverflow.com/questions/36484465/find-jaccard-distance-of-tweets-and-cluster-in-kmeans