text-classification

Scalable or online out-of-core multi-label classifiers

时光怂恿深爱的人放手 提交于 2019-12-02 23:31:23
I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K. I am using scikit-learn . For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have. vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words=

How to classify URLs? what are URLs features? How to select and Extract features from URL

人盡茶涼 提交于 2019-12-02 22:52:46
I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block it. My Question is very specific. How to Classify URLs? Should i use normal text analysis methods? What are URLs Features? How to Select and Extract Features from URL? I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL. Here are some features I will try. See this paper for more ideas: All url

Multilabel Text Classification using TensorFlow

不问归期 提交于 2019-12-02 14:18:07
The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. i-th element indicates the frequency of the i-th word in a text. The ground truth label data is also represented as vector with 4,000 elements, like [0, 0, 1, 0, 1, ...., 0]. i-th element indicates whether the i-th label is a positive label for a text. The number of labels for a text differs depending on texts. I have a code for single-label text classification. How can I edit the following code for multilabel text classification? Especially, I would like to know following points. How to compute accuracy

How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

こ雲淡風輕ζ 提交于 2019-12-02 11:13:41
I used the below code to create k-means clusters using Scikit learn. kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++') kmean_fit = kmean.fit(clus_data) I also have saved the centroids using kmean_fit.cluster_centers_ I then pickled the K means object. filename = pickle_path+'\\'+'_kmean_fit.sav' pickle.dump(kmean_fit, open(filename, 'wb')) So that I can load the same kmeans pickle object and apply it to new data when it comes, using kmean_fit.predict(). Questions : Will the approach of loading kmeans pickle object and

Cannot freeze Tensorflow models into frozen(.pb) file

a 夏天 提交于 2019-12-02 07:21:33
问题 I am referring (here) to freeze models into .pb file. My model is CNN for text classification I am using (Github) link to train CNN for text classification and exporting in form of models. I have trained models to 4 epoch and My checkpoints folders look as follows: I want to freeze this model into (.pb file). For that I am using following script: import os, argparse import tensorflow as tf # The original freeze_graph function # from tensorflow.python.tools.freeze_graph import freeze_graph dir

Cannot freeze Tensorflow models into frozen(.pb) file

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 06:27:37
I am referring ( here ) to freeze models into .pb file. My model is CNN for text classification I am using ( Github ) link to train CNN for text classification and exporting in form of models. I have trained models to 4 epoch and My checkpoints folders look as follows: I want to freeze this model into (.pb file). For that I am using following script: import os, argparse import tensorflow as tf # The original freeze_graph function # from tensorflow.python.tools.freeze_graph import freeze_graph dir = os.path.dirname(os.path.realpath(__file__)) def freeze_graph(model_dir, output_node_names): ""

inconsistent shape error MultiLabelBinarizer on y_test, sklearn multi-label classification

旧巷老猫 提交于 2019-12-02 04:41:54
import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.svm import SVC data = r'C:\Users\...\Downloads\news_v1.xlsx' df = pd.read_excel(data)

expected dense to have shape but got array with shape

廉价感情. 提交于 2019-12-01 16:02:57
I am getting the following error while calling the model.predict function when running a text classification model in keras. I searched the everywhere but it isn't working for me. ValueError: Error when checking input: expected dense_1_input to have shape (100,) but got array with shape (1,) My data has 5 classes and has a total of 15 examples only. Below is the dataset query tags 0 hi intro 1 how are you wellb 2 hello intro 3 what's up wellb 4 how's life wellb 5 bye gb 6 see you later gb 7 good bye gb 8 thanks gratitude 9 thank you gratitude 10 that's helpful gratitude 11 I am great

expected dense to have shape but got array with shape

我只是一个虾纸丫 提交于 2019-12-01 14:38:16
问题 I am getting the following error while calling the model.predict function when running a text classification model in keras. I searched the everywhere but it isn't working for me. ValueError: Error when checking input: expected dense_1_input to have shape (100,) but got array with shape (1,) My data has 5 classes and has a total of 15 examples only. Below is the dataset query tags 0 hi intro 1 how are you wellb 2 hello intro 3 what's up wellb 4 how's life wellb 5 bye gb 6 see you later gb 7

R: LIME returns error on different feature numbers when it's not the case

心不动则不痛 提交于 2019-12-01 12:08:44
I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number()) %>% select(-timestamp) # creating corpus and dfm tweet_corpus <- corpus(tweet_data) edited_dfm <- dfm