countvectorizer

Encoding text in ML classifier

不羁的心 提交于 2020-12-25 10:54:45
问题 I am trying to build a ML model. However I am having difficulties in understanding where to apply the encoding. Please see below the steps and functions to replicate the process I have been following. First I split the dataset into train and test: # Import the resampling package from sklearn.naive_bayes import MultinomialNB import string from nltk.corpus import stopwords import re from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer

In count vectorizer which axis to use?

杀马特。学长 韩版系。学妹 提交于 2020-03-25 05:53:57
问题 I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation. The term count is important for me to create summarization using SVD in further steps. My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood: Axis=1 : Will give me the importance of the word in a sentence (column wise

In count vectorizer which axis to use?

≡放荡痞女 提交于 2020-03-25 05:52:10
问题 I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation. The term count is important for me to create summarization using SVD in further steps. My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood: Axis=1 : Will give me the importance of the word in a sentence (column wise

How to subclass a vectorizer in scikit-learn without repeating all parameters in the constructor

别来无恙 提交于 2020-02-02 16:06:50
问题 I am trying to create a custom vectorizer by subclassing the CountVectorizer . The vectorizer will stem all the words in the sentence before counting the word frequency. I then use this vectorizer in a pipeline which works fine when I do pipeline.fit(X,y) . However, when I try to set a parameter with pipeline.set_params(rf__verbose=1).fit(X,y) , I get the following error: RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no

TypeError: expected string or bytes-like object HashingVectorizer

独自空忆成欢 提交于 2020-01-15 23:54:19
问题 I have been facing this issue while fitting the dataset..Everything seems fine, don't know where the problem is. Since I'm a beginner could anyone please tell me what I am doing wrong or am I missing something? The problem seems to be in data preprocessing part Error trace and the dataframe's head has been attached as image below ` train = pd.read_csv('train.txt', sep='\t', dtype=str, header=None) test = pd.read_csv('test.txt', sep='\t', dtype=str, header=None) X_train = train.iloc[:,1:] y

Nested dict of lists to pandas DataFrame

筅森魡賤 提交于 2020-01-06 07:14:05
问题 I have a rather messy nested dictionary that I am trying to convert to a pandas data frame. The data is stored in a dictionary of lists contained in a broader dictionary, where each key/value breakdown follows: {userID_key: {postID_key: [list of hash tags]}} Here's a more specific example of what the data looks like: {'user_1': {'postID_1': ['#fitfam', '#gym', '#bro'], 'postID_2': ['#swol', '#anotherhashtag']}, 'user_2': {'postID_78': ['#ripped', '#bro', '#morehashtags'], 'postID_1': ['#buff'

Scala Spark - split vector column into separate columns in a Spark DataFrame

心已入冬 提交于 2019-12-29 08:01:47
问题 I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn") , each corresponds to one element in the vector. some_columns... | Features ... | [0,1,0,..., 0] to some_columns... | f1 | f2 | f3 | ... | fn ... | 0 | 1 | 0 | ... | 0 What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

流过昼夜 提交于 2019-12-11 15:17:10
问题 I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of : bigrams + trigrams + word-marks vocabulary He means by word-marks here, the words that are specific to a certain dialect. How can I tweak those parameters in countVectorizer? word marks So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them. word_marks=['love', 'funny', 'happy',

Using countVectorizer to compute word occurrence for my own vocabulary in python

余生长醉 提交于 2019-12-07 15:41:18
问题 Doc1: ['And that was the fallacy. Once I was free to talk with staff members'] Doc2: ['In the new, stripped-down, every-job-counts business climate, these human'] Doc3 : ['Another reality makes emotional intelligence ever more crucial'] Doc4: ['The globalization of the workforce puts a particular premium on emotional'] Doc5: ['As business changes, so do the traits needed to excel. Data tracking'] and this is a sample of my vocabulary: my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped

Using countVectorizer to compute word occurrence for my own vocabulary in python

倾然丶 夕夏残阳落幕 提交于 2019-12-05 18:30:24
Doc1: ['And that was the fallacy. Once I was free to talk with staff members'] Doc2: ['In the new, stripped-down, every-job-counts business climate, these human'] Doc3 : ['Another reality makes emotional intelligence ever more crucial'] Doc4: ['The globalization of the workforce puts a particular premium on emotional'] Doc5: ['As business changes, so do the traits needed to excel. Data tracking'] and this is a sample of my vocabulary: my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’] The point is every word in my vocabulary is