text-analysis | 易学教程

How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? - python

阅读更多关于 How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? - python

问题 How do I use sklearn CountVectorizer with both 'word' and 'char' analyzer? http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html I could extract the text features by word or char separately but how do i create a charword_vectorizer ? Is there a way to combine the vectorizers? or use more than one analyzer? >>> from sklearn.feature_extraction.text import CountVectorizer >>> word_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2),

tag generation from a small text content (such as tweets)

阅读更多关于 tag generation from a small text content (such as tweets)

问题 I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents. With this constrain(working on small set of texts), how can I generate tags ? Regards 回答1: Two Stage Approach for Multiword Tags You could pool all the tweets into a single larger document and then

Stemmers vs Lemmatizers

阅读更多关于 Stemmers vs Lemmatizers

问题 Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems. Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms. Stemmers [in]: having [out]: hav

Beyond SOUNDEX & DIFFERENCE - SQL Server

阅读更多关于 Beyond SOUNDEX & DIFFERENCE - SQL Server

问题 I am using SOUNDEX & DIFFERENCE functions to do some analysis on the data present in the table. But this function fails at below type of data. The ITEM TYPE & ITEM SIZE are completely different. SELECT SOUNDEX('ITEM TYPE'), SOUNDEX('ITEM SIZE') op:- I350 I350 For DIFFERENCE op: - 4 I understand every analysis that human mind do can not be coded, still I would like to ask, are there exists any other functions in SQL Server that will help me out on my next level analysis ? 回答1: You can use an

How to comma separate words when using Pypdf2 library

阅读更多关于 How to comma separate words when using Pypdf2 library

问题 I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :- filename = 'CS1.pdf' pdfFileObj = open(filename,'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) num_pages = pdfReader.numPages count = 0 text = "" while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 print(pageObj) text += pageObj.extractText() if text != "": text = text else: text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method

Splitting strings in R

阅读更多关于 Splitting strings in R

问题 I have a following line x<-"CUST_Id_8Name:Mr.Praveen KumarDOB:Mother's Name:Contact Num:Email address:Owns Car:Products held with Bank:Company Name:Salary per. month:Background:" I want to extract "CUST_Id_8", "Mr. Praveen Kumar" and anything written after DOB: Mother's name: Contact Num: and so on stored in variables like Customer Id, Name, DOB and so on. Please help. I used strsplit(x, ":") But the result is a list containing the texts. But I need blanks if there is nothing after the

How to identify stopwords with BigQuery?

阅读更多关于 How to identify stopwords with BigQuery?

问题 I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL? 回答1: One approach to identify stopwords is to look at the ones that show up in most documents. Steps in this query: Filter posts for relevancy, quality (choose your subreddits, choose a minimum score, choose a minimum length). Unescape reddit HTML encoded values. Decide what counts as a word (in this case r'[a-z]{1,20}\'?[a-z]+' ). Each word

Extract text between two delimiters from a text file

阅读更多关于 Extract text between two delimiters from a text file

问题 I'm currently writing my master thesis about CEO narcissism. In order to measure it, I have to do an earnings calls text analysis. I wrote a code in python, following the answers available in this link, that allows me to extract the Question and Answers section from an earnings calls transcript. The file is like this (it's called 'testoestratto.txt'): .............................. Delimiter [1] .............................. A text that I don't need .............................. Delimiter

NLP to classify/label the content of a sentence (Ruby binding necesarry)

阅读更多关于 NLP to classify/label the content of a sentence (Ruby binding necesarry)

问题 I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.: Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.) Customer service problems (slow email response time, impolite response, etc.) Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.) Pricing complaint (hidden fee's discovered, etc.) In order to perform this classification, I need a

Error using “TermDocumentMatrix” and “Dist” functions in R

阅读更多关于 Error using “TermDocumentMatrix” and “Dist” functions in R

问题 I have been trying to replicate the example here: but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm