text-analysis | 易学教程

NLP to classify/label the content of a sentence (Ruby binding necesarry)

阅读更多关于 NLP to classify/label the content of a sentence (Ruby binding necesarry)

I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.: Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.) Customer service problems (slow email response time, impolite response, etc.) Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.) Pricing complaint (hidden fee's discovered, etc.) In order to perform this classification, I need a NLP that can recognize the combination of word groups like: "[they|the company|the firm|the website|the

Create dfm step by step with quanteda

阅读更多关于 Create dfm step by step with quanteda

问题 I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm . I want to proceed step by step instead of using the automated way with dfm() . I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures. I would like this sequence to be implemented: 1) remove the

How to combine TFIDF features with other features

阅读更多关于 How to combine TFIDF features with other features

I have a classic NLP problem, I have to classify a news as fake or real. I have created two sets of features: A) Bigram Term Frequency-Inverse Document Frequency B) Approximately 20 Features associated to each document obtained using pattern.en ( https://www.clips.uantwerpen.be/pages/pattern-en ) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ... Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone. Not sure if your asking technically how to combine two objects in code or

Use brain.js neural network to do text analysis

阅读更多关于 Use brain.js neural network to do text analysis

I'm trying to do some text analysis to determine if a given string is... talking about politics. I'm thinking I could create a neural network where the input is either a string or a list of words (ordering might matter?) and the output is whether the string is about politics. However the brain.js library only takes inputs of a number between 0 and 1 or an array of numbers between 0 and 1. How can I coerce my data in such a way that I can achieve the task? new brain.recurrent.LSTM(); this does the trick for you. Example, var brain = require('brain.js') var net = new brain.recurrent.LSTM(); net

Trying to get tf-idf weighting working in R

阅读更多关于 Trying to get tf-idf weighting working in R

问题 I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm')

Trying to get tf-idf weighting working in R

阅读更多关于 Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm') corpus <- Corpus(DirSource('.')) dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf)) str

ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]

阅读更多关于 ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]

Here is my code: from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import KFold from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import datasets import numpy as np newsgroups = datasets.fetch_20newsgroups( subset='all', categories=['alt.atheism', 'sci.space'] ) X = newsgroups.data y = newsgroups.target TD_IF = TfidfVectorizer() y_scaled = TD_IF.fit_transform(newsgroups, y) grid = {'C': np.power(10.0, np.arange(-5, 6))} cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) clf = SVC(kernel='linear',

Error faced while using TM package's VCorpus in R

阅读更多关于 Error faced while using TM package's VCorpus in R

I am facing the below error while working on the TM package with R. library("tm") Loading required package: NLP Warning messages: 1: package ‘tm’ was built under R version 3.4.2 2: package ‘NLP’ was built under R version 3.4.1 corpus <- VCorpus(DataframeSource(data)) Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R. Eva I met the same problem when I updated the tm package to

Convert sparse matrix (csc_matrix) to pandas dataframe

阅读更多关于 Convert sparse matrix (csc_matrix) to pandas dataframe

I want to convert this matrix into a pandas dataframe. csc_matrix The first number in the bracket should be the index , the second number being columns and the number in the end being the data . I want to do this to do feature selection in text analysis, the first number represents the document, the second being the feature of word and the last number being the TFIDF score. Getting a dataframe helps me to transform the text analysis problem into data analysis. from scipy.sparse import csc_matrix csc = csc_matrix(np.array( [[0, 0, 4, 0, 0, 0], [1, 0, 0, 0, 2, 0], [2, 0, 0, 1, 0, 0], [0, 0, 0, 0

Any tutorial or code for Tf Idf in java

阅读更多关于 Any tutorial or code for Tf Idf in java

问题 I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :) OR If you can tell me some good java tutorial for this. Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :( Please