text-analysis

NLP to classify/label the content of a sentence (Ruby binding necesarry)

一世执手 提交于 2019-12-03 17:22:55
I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.: Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.) Customer service problems (slow email response time, impolite response, etc.) Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.) Pricing complaint (hidden fee's discovered, etc.) In order to perform this classification, I need a NLP that can recognize the combination of word groups like: "[they|the company|the firm|the website|the

Create dfm step by step with quanteda

末鹿安然 提交于 2019-12-03 14:02:48
问题 I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm . I want to proceed step by step instead of using the automated way with dfm() . I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures. I would like this sequence to be implemented: 1) remove the

How to combine TFIDF features with other features

本秂侑毒 提交于 2019-12-03 06:20:46
I have a classic NLP problem, I have to classify a news as fake or real. I have created two sets of features: A) Bigram Term Frequency-Inverse Document Frequency B) Approximately 20 Features associated to each document obtained using pattern.en ( https://www.clips.uantwerpen.be/pages/pattern-en ) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ... Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone. Not sure if your asking technically how to combine two objects in code or

Use brain.js neural network to do text analysis

青春壹個敷衍的年華 提交于 2019-12-03 03:03:45
I'm trying to do some text analysis to determine if a given string is... talking about politics. I'm thinking I could create a neural network where the input is either a string or a list of words (ordering might matter?) and the output is whether the string is about politics. However the brain.js library only takes inputs of a number between 0 and 1 or an array of numbers between 0 and 1. How can I coerce my data in such a way that I can achieve the task? new brain.recurrent.LSTM(); this does the trick for you. Example, var brain = require('brain.js') var net = new brain.recurrent.LSTM(); net

Trying to get tf-idf weighting working in R

折月煮酒 提交于 2019-12-03 02:52:10
问题 I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm')

Trying to get tf-idf weighting working in R

隐身守侯 提交于 2019-12-02 17:33:25
I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm') corpus <- Corpus(DirSource('.')) dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf)) str

ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]

我的梦境 提交于 2019-12-01 20:46:39
Here is my code: from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import KFold from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import datasets import numpy as np newsgroups = datasets.fetch_20newsgroups( subset='all', categories=['alt.atheism', 'sci.space'] ) X = newsgroups.data y = newsgroups.target TD_IF = TfidfVectorizer() y_scaled = TD_IF.fit_transform(newsgroups, y) grid = {'C': np.power(10.0, np.arange(-5, 6))} cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) clf = SVC(kernel='linear',

Error faced while using TM package's VCorpus in R

最后都变了- 提交于 2019-12-01 16:59:40
I am facing the below error while working on the TM package with R. library("tm") Loading required package: NLP Warning messages: 1: package ‘tm’ was built under R version 3.4.2 2: package ‘NLP’ was built under R version 3.4.1 corpus <- VCorpus(DataframeSource(data)) Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R. Eva I met the same problem when I updated the tm package to

Convert sparse matrix (csc_matrix) to pandas dataframe

戏子无情 提交于 2019-11-30 12:03:36
I want to convert this matrix into a pandas dataframe. csc_matrix The first number in the bracket should be the index , the second number being columns and the number in the end being the data . I want to do this to do feature selection in text analysis, the first number represents the document, the second being the feature of word and the last number being the TFIDF score. Getting a dataframe helps me to transform the text analysis problem into data analysis. from scipy.sparse import csc_matrix csc = csc_matrix(np.array( [[0, 0, 4, 0, 0, 0], [1, 0, 0, 0, 2, 0], [2, 0, 0, 1, 0, 0], [0, 0, 0, 0

Any tutorial or code for Tf Idf in java

Deadly 提交于 2019-11-30 07:02:49
问题 I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :) OR If you can tell me some good java tutorial for this. Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :( Please