sklearn-pandas

Create Sparse Matrix in Python

最后都变了- 提交于 2019-12-23 04:58:17
问题 Working with data and would like to create a sparse matrix to later be used for clustering purposes. fileHandle = open('data', 'r') for line in fileHandle: json_list = [] fields = line.split('\t') json_list.append(fields[0]) json_list.append(fields[1]) json_list.append(fields[3]) Right now the data looks like this: term, ids, quantity ['buick', '123,234', '500'] ['chevy', '345,456', '300'] ['suv','123', '100'] The output I would need would be like this: term, quantity, '123', '234', '345',

Train multiple different sklearn models in parallel

本小妞迷上赌 提交于 2019-12-22 13:53:23
问题 Is it possible to train in parallel multiple different sklearn models? For example, I'd like to train one SVM, one RandomForest and one Linear Regression model at the same time. The desired output would be a list of objects returned by the .fit method. 回答1: Is it possible to train in parallel multiple different sklearn models? Training multiple models? YES. Training multiple models in true-[PARALLEL] scheduling fashion? NO. Training one particular model, using some sort of low-level, fine

Read multiple CSV files in Pandas in chunks

删除回忆录丶 提交于 2019-12-22 10:29:28
问题 How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb? I don't want to use Spark as i want to use a model in SkLearn so I want the solution in Pandas itself. My code is: allFiles = glob.glob(os.path.join(path, "*.csv")) df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles)) df.reset_index(drop=True, inplace=True) But this is failing as the total size of all the csv in my path is 17gb. I want to read it in chunks but I

Reverse Label Encoding giving error

﹥>﹥吖頭↗ 提交于 2019-12-22 09:45:47
问题 I label encoded my categorical data into numerical data using label encoder data['Resi'] = LabelEncoder().fit_transform(data['Resi']) But I when I try to find how they are mapped internally using list(LabelEncoder.inverse_transform(data['Resi'])) I am getting below error TypeError Traceback (most recent call last) <ipython-input-67-419ab6db89e2> in <module>() ----> 1 list(LabelEncoder.inverse_transform(data['Resi'])) TypeError: inverse_transform() missing 1 required positional argument: 'y'

Scikit K-means clustering performance measure

拟墨画扇 提交于 2019-12-20 19:39:34
问题 I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering. I'm not an expert but I am eager to learn more about clustering. Here is my code : import pandas as pd from sklearn import datasets #loading the dataset iris = datasets.load_iris() df = pd.DataFrame(iris.data) #K-Means from sklearn import cluster k_means = cluster.KMeans(n_clusters=3) k_means.fit(df) #K-means training y_pred = k_means.predict(df) #We store the K-means results in a

sklearn stratified sampling based on a column

泄露秘密 提交于 2019-12-20 09:56:36
问题 I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally. The data looks like this: **ReviewerID** **ReviewText** **Categories** **ProductId** 1212 good product Mobile 14444425 1233 will

Indexing a CSV running into inconsistent number of samples for logistic regression

南笙酒味 提交于 2019-12-14 03:27:09
问题 I'm currently indexing a CSV with values below and running into the error: ValueError: Found input variables with inconsistent numbers of samples: [1, 514] It's examining it as 1 row with 514 columns which emphasize that I have called a specific parameter wrong or is it due to me removing NaN's (which most of the data would default as?) "Classification","DGMLEN","IPLEN","TTL","IP" "1","0.000000","192.168.1.5","185.60.216.35","TLSv1.2" "2","0.000160","192.168.1.5","185.60.216.35","TCP" "3","0

Number of features of the model must match the input. Model n_features is 40 and input n_features is 38

一曲冷凌霜 提交于 2019-12-13 23:53:02
问题 i am getting this error.please give me any suggestion to resolve it.here is my code.i am taking traing data from train.csv and testing data from another file test.csv.i am new to machine learning so i could not understand what is the problem.give me any suggestion. import quandl,math import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import style import datetime from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import

Is there a way to see the order of nodes categorizing data in decision trees when not allowed to install graphviz nor pydotplus?

江枫思渺然 提交于 2019-12-13 03:18:36
问题 I need to know the order of the nodes and the scores for each one, once I have ran the decision tree model. As I'm working in my office computer, the installations are very restricted and I'm not allowed to download graphviz nor pydotplus. It doesn't matter that there is no graphic representation of the model; I just want to know the classification order/process the algorithm is using. I'm using sklearn.tree , sklearn.metrics , and sklearn.cross_validation . 回答1: You can make use of plot_tree

Sklearn LabelEncoder throws TypeError in sort

走远了吗. 提交于 2019-12-12 10:36:08
问题 I am learning machine learning using Titanic dataset from Kaggle. I am using LabelEncoder of sklearn to transform text data to numeric labels. The following code works fine for "Sex" but not for "Embarked". encoder = preprocessing.LabelEncoder() features["Sex"] = encoder.fit_transform(features["Sex"]) features["Embarked"] = encoder.fit_transform(features["Embarked"]) This is the error I got Traceback (most recent call last): File "../src/script.py", line 20, in <module> features["Embarked"] =