text-mining

Python Regex - Extract text between (multiple) expressions in a textfile

ε祈祈猫儿з 提交于 2019-12-02 11:53:29
问题 I am a Python beginner and would be very thankful if you could help me with my text extraction problem. I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like ->

Memory error in python using numpy array

∥☆過路亽.° 提交于 2019-12-02 11:11:15
I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1] print('*Topic {}\n- {}'

Unable to process accented words using NLTK tokeniser

帅比萌擦擦* 提交于 2019-12-02 08:33:34
问题 I'm trying to compute the frequencies of words in an utf-8 encoded text file with the following code. Having successfully tokenized the file content and then looping through the words, my program is not able to read the accented characters. import csv import nltk from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords print "computing word frequency..." if lang == "fr": stop = stopwords.words("french") stop = [word.encode("utf-8") for word in stop] stop.append("les") stop

R: Got problems in reading text file

无人久伴 提交于 2019-12-02 07:25:59
问题 I want to read text file in R. The code used to work. But when I want to retest it, it didn't. #There are several text files in file'Obama' and file 'Romney' candidates<-c("Obama","Romney") pathname<-"C:/txt" s.dir<-sprintf("%s/%s",pathname,candidates) article<-Corpus(DirSource(directory=s.dir,encoding="ANSI")) The error it displayed is Error in iconv(readLines(x, warn = FALSE), encoding, "UTF-8", "byte") : unsupported conversion from 'ANSI' to 'UTF-8' in codepage 936 Also, when I use the

Information Gain Calculation for a text file?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 06:57:31
I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to convert text file into .arff format. or any other way to preform Information gain other than weka? Is

text mining with tm package in R ,remove words starting from [http] or any other specifc word

微笑、不失礼 提交于 2019-12-02 04:55:14
I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*")) somebody into text-minning please help me with this. If you are looking to remove URLs from your string, you may use: gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x) Where x would be: x <- c("some text http://idontwantthis.com", "same problem again http:/

Remove all punctuation from text including apostrophes for tm package

寵の児 提交于 2019-12-02 04:18:03
I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm package like so: clean_tweet_text = removePunctuation(tweet_text) This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate but if a tweet has 'climate it won't be counted. How can I removes all the apostrophes/single quotes from my vector? Here is the header from dput for a

R: Got problems in reading text file

自闭症网瘾萝莉.ら 提交于 2019-12-02 03:50:17
I want to read text file in R. The code used to work. But when I want to retest it, it didn't. #There are several text files in file'Obama' and file 'Romney' candidates<-c("Obama","Romney") pathname<-"C:/txt" s.dir<-sprintf("%s/%s",pathname,candidates) article<-Corpus(DirSource(directory=s.dir,encoding="ANSI")) The error it displayed is Error in iconv(readLines(x, warn = FALSE), encoding, "UTF-8", "byte") : unsupported conversion from 'ANSI' to 'UTF-8' in codepage 936 Also, when I use the code below to try to read a single text file: m<-"C:/txt/Romney/1.txt" cc<-Corpus(DirSource(directory=m

Python Regex - Extract text between (multiple) expressions in a textfile

孤者浪人 提交于 2019-12-02 03:44:23
I am a Python beginner and would be very thankful if you could help me with my text extraction problem. I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no

Error faced while using TM package's VCorpus in R

最后都变了- 提交于 2019-12-01 16:59:40
I am facing the below error while working on the TM package with R. library("tm") Loading required package: NLP Warning messages: 1: package ‘tm’ was built under R version 3.4.2 2: package ‘NLP’ was built under R version 3.4.1 corpus <- VCorpus(DataframeSource(data)) Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R. Eva I met the same problem when I updated the tm package to