text-mining | 易学教程

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

阅读更多关于 AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

问题 I try to apply this code : pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression()) param_grid = {'logisticregression__C': [ 0.001, 0.01, 0.1, 1, 10, 100], "tfidfvectorizer__ngram_range": [(1, 1),(1, 2),(1, 3)]} grid = GridSearchCV(pipe, param_grid, cv=5) grid.fit(text_train, Y_train) scores = grid.cv_results_['mean_test_score'].reshape(-1, 3).T # visualize heat map heatmap = mglearn.tools.heatmap( scores, xlabel="C", ylabel="ngram_range", cmap="viridis", fmt="%.3f", xticklabels

sentiment analysis - wordNet , sentiWordNet lexicon

阅读更多关于 sentiment analysis - wordNet , sentiWordNet lexicon

问题 I need a list of positive and negative words with the weights assigned to words according to how strong and week they are. I have got : 1.) WordNet - It gives a + or - score for every word. 2.) SentiWordNet - Giving positive and negative values in the range [0,1]. I checked these on few words, love - wordNet is giving 0.0 for both noun and verb, I dont know why i think it should be positive by at least some factor. repress - wordNet gives -9.93 - SentiWordNet gives - 0.0 for both pos and neg.

Inconsistent behaviour with tm_map transformation functions when using multiple cores

阅读更多关于 Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this doc will be transformed to just "n". Sample corpus is below for

R, merge multiple rows of text data frame into one cell

阅读更多关于 R, merge multiple rows of text data frame into one cell

I have a text data frame that looks like below. > nrow(gettext.df) [1] 3 > gettext.df gettext 1 hello, 2 Good to hear back from you. 3 I've currently written an application and I'm happy about it I wanted to merge this text data into one cell (to do sentiment analysis) as below > gettext.df gettext 1 hello, Good to hear back from you. I've currently written an application and I'm happy about it so I collapsed the cell using below code paste(gettext.df, collapse =" ") but it seems like it makes those text data into one chunk (as one word) so I cannot scan the sentence word by word. Is there any

Bytes vs Characters vs Words - which granularity for n-grams?

阅读更多关于 Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? Evaluate . The criterion for choosing the representation is whatever works . Indeed, character level (!=

Text clustering using Scipy Hierarchy Clustering in Python

阅读更多关于 Text clustering using Scipy Hierarchy Clustering in Python

I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles. This is the code I used to do the clustering # Agglomerative Clustering import matplotlib.pyplot as plt import scipy.cluster.hierarchy as hac tree = hac.linkage(X.toarray(), method="complete",metric="euclidean") plt.clf() hac.dendrogram(tree) plt.show() and I got this plot Then I cut off the tree at the third level with fcluster() from scipy.cluster.hierarchy import fcluster clustering = fcluster(tree,3,'maxclust')

R Text mining - how to change texts in R data frame column into several columns with bigram frequencies?

阅读更多关于 R Text mining - how to change texts in R data frame column into several columns with bigram frequencies?

In addition to question R Text mining - how to change texts in R data frame column into several columns with word frequencies? I am wondering how I can manage to make columns with bigrams frequencies instead of just word frequencies. Again, many thanks in advance! This is the example data frame (thanks to Tyler Rinker). person sex adult state code 1 sam m 0 Computer is fun. Not too fun. K1 2 greg m 0 No it's not, it's dumb. K2 3 teacher m 1 What should we do? K3 4 sam m 0 You liar, it stinks! K4 5 greg m 0 I am telling the truth! K5 6 sally f 0 How can we be certain? K6 7 greg m 0 There is no

text-mine PDF files with Python?

阅读更多关于 text-mine PDF files with Python?

问题 Is there a package/library for python that would allow me to open a PDF, and search the text for certain words? 回答1: Using PyPdf2 you can use extractText() method to extract pdf text and work on it. Update: Changed text to refer to PyPdf2, thanks to @Aditya Kumar for heads up. 回答2: I don't think you can do it in one step, but you can certainly get the text out of a pdf with pdfminer. Then you can apply whatever text search to that recovered data. 来源： https://stackoverflow.com/questions

R Text Mining: Counting the number of times a specific word appears in a corpus?

阅读更多关于 R Text Mining: Counting the number of times a specific word appears in a corpus?

I have seen this question answered in other languages but not in R. [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus. Is there a way to do this in TM package? (Or another related package) For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB. As always, I appreciate all your help! Ain't perfect

text-mine PDF files with Python?

阅读更多关于 text-mine PDF files with Python?

Is there a package/library for python that would allow me to open a PDF, and search the text for certain words? Using PyPdf2 you can use extractText() method to extract pdf text and work on it. Update: Changed text to refer to PyPdf2, thanks to @Aditya Kumar for heads up. I don't think you can do it in one step, but you can certainly get the text out of a pdf with pdfminer . Then you can apply whatever text search to that recovered data. 来源： https://stackoverflow.com/questions/1672202/text-mine-pdf-files-with-python