text-mining

How to find the closest word to a vector using word2vec

浪子不回头ぞ 提交于 2019-12-03 02:28:36
I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) Is there a straight forward way to find the most similar word in my training data to this vector? Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one? Thanks. For gensim implementation of word2vec there is most_similar() function that lets you find

“RTextTools” create_matrix got an error

霸气de小男生 提交于 2019-12-03 02:24:28
I was running RTextTools package to build a text classification model. And when I prepare the prediction dataset and tried to transform it in to matrix. I got error as: Error in if (attr(weighting, "Acronym") == "tf-idf") weight <- 1e-09 : argument is of length zero My code is as below: table<-read.csv("traintest.csv",header = TRUE) dtMatrix <- create_matrix(table["COMMENTS"]) container <- create_container(dtMatrix, table$LIKELIHOOD_TO_RECOMMEND, trainSize=1:5000,testSize=5001:10000, virgin=FALSE) model <- train_model(container, "SVM", kernel="linear", cost=1) predictionData<-read.csv("rest

How to identify ideas and concepts in a given text

拈花ヽ惹草 提交于 2019-12-03 01:17:55
I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained: Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, or even better a photograph? It'd be great to be able to detect that the person has asked for a photograph of Mr Jones. I could take a really naïve approach and just look for the word "photo" or "photograph", but this would obviously be no good if they wrote

C# Sentiment Analysis [closed]

…衆ロ難τιáo~ 提交于 2019-12-03 00:41:57
Does anyone know of a (preferably open source) C# library that can be implemented to calculate the overall sentiment of some given text? Take a look at an open source sentiment analysis engine based on Naive Bayes classification at https://github.com/amrishdeep/Dragon . http://rapid-i.com/content/view/26/84/ maybe not c# but I guess you can call it from c#, will it does the job for you? 来源: https://stackoverflow.com/questions/494276/c-sharp-sentiment-analysis

How to select stop words using tf-idf? (non english corpus)

别等时光非礼了梦想. 提交于 2019-12-02 22:53:54
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune. The best (as in

Better text documents clustering than tf/idf and cosine similarity?

≯℡__Kan透↙ 提交于 2019-12-02 19:21:49
I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad. The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences: 1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website. The prevoiuse two sentences will likely by clustered together with a

Python or Java for text processing (text mining, information retrieval, natural language processing) [closed]

放肆的年华 提交于 2019-12-02 18:33:11
I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on. There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents. Can I use Python to do this, or is Python too slow? Is it best to use Java? If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's

How to calculate TF*IDF for a single new document to be classified?

折月煮酒 提交于 2019-12-02 18:13:11
I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too. My question is, how could I calculate the TF*IDF with just a single document? As far as I understand, TF can be calculated based on a single document itself, but the IDF can only

Text classification/categorization algorithm [closed]

江枫思渺然 提交于 2019-12-02 17:40:45
My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET library that implements ше? Ralph M. Rickenbach Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category. Yet, in natural language text, the keywords would

why two vectors is not similarity but result is 1?

只谈情不闲聊 提交于 2019-12-02 13:29:46
I'm using Cosine Similarity formula to caculate similarity between two vectors. I tried two different vectors like this: Vector1(-1237373741, 27, 1, 1, 331289590, 1818540802) Vector2(-1237373741, 49, 1, 1, 331289590, 1818540802) Two vectors has a little different, but the result is 1 . I don't know why? Anyone can explain this problem for me? thanks so much. For the most part, those two vectors are are pointing in the same direction (The larger coordinates are going to dominate the smaller differences in the other coordinate). A cosine similarity of ~1 is expected (Remember that cos(0) = 1) 来源