text-mining

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

夙愿已清 提交于 2019-12-01 03:17:13
问题 I try to apply this code : pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression()) param_grid = {'logisticregression__C': [ 0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer__ngram_range": [(1, 1),(1, 2),(1, 3)]} grid = GridSearchCV(pipe, param_grid, cv=5) grid.fit(text_train, Y_train) scores = grid.cv_results_['mean_test_score'].reshape(-1, 3).T # visualize heat map heatmap = mglearn.tools.heatmap( scores, xlabel="C", ylabel="ngram_range", cmap="viridis", fmt="%.3f", xticklabels

sentiment analysis - wordNet , sentiWordNet lexicon

和自甴很熟 提交于 2019-12-01 02:13:40
问题 I need a list of positive and negative words with the weights assigned to words according to how strong and week they are. I have got : 1.) WordNet - It gives a + or - score for every word. 2.) SentiWordNet - Giving positive and negative values in the range [0,1]. I checked these on few words, love - wordNet is giving 0.0 for both noun and verb, I dont know why i think it should be positive by at least some factor. repress - wordNet gives -9.93 - SentiWordNet gives - 0.0 for both pos and neg.

Inconsistent behaviour with tm_map transformation functions when using multiple cores

折月煮酒 提交于 2019-12-01 02:11:06
Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this doc will be transformed to just "n". Sample corpus is below for

R, merge multiple rows of text data frame into one cell

耗尽温柔 提交于 2019-11-30 20:53:48
I have a text data frame that looks like below. > nrow(gettext.df) [1] 3 > gettext.df gettext 1 hello, 2 Good to hear back from you. 3 I've currently written an application and I'm happy about it I wanted to merge this text data into one cell (to do sentiment analysis) as below > gettext.df gettext 1 hello, Good to hear back from you. I've currently written an application and I'm happy about it so I collapsed the cell using below code paste(gettext.df, collapse =" ") but it seems like it makes those text data into one chunk (as one word) so I cannot scan the sentence word by word. Is there any

Bytes vs Characters vs Words - which granularity for n-grams?

北慕城南 提交于 2019-11-30 20:34:37
At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? Evaluate . The criterion for choosing the representation is whatever works . Indeed, character level (!=

Text clustering using Scipy Hierarchy Clustering in Python

ぐ巨炮叔叔 提交于 2019-11-30 16:32:33
I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles. This is the code I used to do the clustering # Agglomerative Clustering import matplotlib.pyplot as plt import scipy.cluster.hierarchy as hac tree = hac.linkage(X.toarray(), method="complete",metric="euclidean") plt.clf() hac.dendrogram(tree) plt.show() and I got this plot Then I cut off the tree at the third level with fcluster() from scipy.cluster.hierarchy import fcluster clustering = fcluster(tree,3,'maxclust')

R Text mining - how to change texts in R data frame column into several columns with bigram frequencies?

半世苍凉 提交于 2019-11-30 16:31:23
In addition to question R Text mining - how to change texts in R data frame column into several columns with word frequencies? I am wondering how I can manage to make columns with bigrams frequencies instead of just word frequencies. Again, many thanks in advance! This is the example data frame (thanks to Tyler Rinker). person sex adult state code 1 sam m 0 Computer is fun. Not too fun. K1 2 greg m 0 No it's not, it's dumb. K2 3 teacher m 1 What should we do? K3 4 sam m 0 You liar, it stinks! K4 5 greg m 0 I am telling the truth! K5 6 sally f 0 How can we be certain? K6 7 greg m 0 There is no

text-mine PDF files with Python?

末鹿安然 提交于 2019-11-30 15:48:50
问题 Is there a package/library for python that would allow me to open a PDF, and search the text for certain words? 回答1: Using PyPdf2 you can use extractText() method to extract pdf text and work on it. Update: Changed text to refer to PyPdf2, thanks to @Aditya Kumar for heads up. 回答2: I don't think you can do it in one step, but you can certainly get the text out of a pdf with pdfminer. Then you can apply whatever text search to that recovered data. 来源: https://stackoverflow.com/questions

R Text Mining: Counting the number of times a specific word appears in a corpus?

独自空忆成欢 提交于 2019-11-30 15:40:46
I have seen this question answered in other languages but not in R. [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus. Is there a way to do this in TM package? (Or another related package) For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB. As always, I appreciate all your help! Ain't perfect

text-mine PDF files with Python?

自作多情 提交于 2019-11-30 15:07:51
Is there a package/library for python that would allow me to open a PDF, and search the text for certain words? Using PyPdf2 you can use extractText() method to extract pdf text and work on it. Update: Changed text to refer to PyPdf2, thanks to @Aditya Kumar for heads up. I don't think you can do it in one step, but you can certainly get the text out of a pdf with pdfminer . Then you can apply whatever text search to that recovered data. 来源: https://stackoverflow.com/questions/1672202/text-mine-pdf-files-with-python