categorical-data

Reveal k-modes cluster features

独自空忆成欢 提交于 2019-12-02 21:24:51
I'm performing a cluster analysis on categorical data, hence using k-modes approach. My data is shaped as a preference survey: How do you like hair and eyes? The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility. I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca. My code looks like: import numpy as np import pandas as pd from kmodes import kmodes df_dummy = pd.get_dummies(df) #transform into numpy array x = df_dummy.reset_index().values km = kmodes.KModes(n_clusters=3, init='Huang', n

Crosstab with multiple items

蓝咒 提交于 2019-12-02 19:46:27
In SPSS, it is (relatively) easy to create a cross tab with multiple variables using the factors (or values) as the table heading. So, something like the following (made up data, etc.). Q1, Q2, and Q3 each have either a 1, a 2, or a 3 for each person. I just left these as numbers, but they could be factors, neither seemed to help solve the problem. 1 (very Often) 2 (Rarely) 3 (Never) Q1. Likes it 12 15 13 Q2. Recommends it 22 11 10 Q3. Used it 22 12 9 In SPSS, one can even request row, column, or total percentages. I've tried table(), ftable(), xtab(), CrossTable() from gmodels, and CrossTable

Problems with a binary one-hot (one-of-K) coding in python

时光怂恿深爱的人放手 提交于 2019-12-02 19:45:45
Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me. Pandas and get_dummies in the

Python equivalent of daisy() in the cluster package of R

放肆的年华 提交于 2019-12-02 19:30:53
I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows: if(!require("cluster")) { install.packages("cluster"); require("cluster") } data(flower) as.matrix(daisy(flower, metric = "gower")) This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R? Or maybe any other module function that

Convert multiple binary columns to single categorical column [duplicate]

痴心易碎 提交于 2019-12-02 11:38:56
问题 This question already has answers here : For each row return the column name of the largest value (7 answers) Closed last year . I have a table full of binary variables that I would like to condense down to categorical variables. Very simplistically, I have is a data frame like this: data <- data.frame(id=c(1,2,3,4,5,6,7,8,9), red=c("1","0","0","0","1","0","0","0","0"),blue=c("0","1","1","1","0","1","1","1","0"),yellow=c("0","0","0","0","0","0","0","0","1")) data id red blue yellow 1 1 1 0 0

R's caret training errors when y is not a factor

家住魔仙堡 提交于 2019-12-02 04:52:33
问题 I am using R-studio and am using kaggle's forest cover data and keep getting an error when trying to use the knn3 function in caret. here is my code: library(caret) train <- read.csv("C:/data/forest_cover/train.csv", header=T) trainingRows <- createDataPartition(train$Cover_Type, p=0.8, list=F) head(trainingRows) train_train <- train[trainingRows,] train_test <- train[-trainingRows,] knnfit <- knn3(train_train[,-56], train_train$Cover_Type) This last line gives me this in the console: Error

Convert multiple binary columns to single categorical column [duplicate]

╄→гoц情女王★ 提交于 2019-12-02 04:12:21
This question already has an answer here: For each row return the column name of the largest value 7 answers I have a table full of binary variables that I would like to condense down to categorical variables. Very simplistically, I have is a data frame like this: data <- data.frame(id=c(1,2,3,4,5,6,7,8,9), red=c("1","0","0","0","1","0","0","0","0"),blue=c("0","1","1","1","0","1","1","1","0"),yellow=c("0","0","0","0","0","0","0","0","1")) data id red blue yellow 1 1 1 0 0 2 2 0 1 0 3 3 0 1 0 4 4 0 1 0 5 5 1 0 0 6 6 0 1 0 7 7 0 1 0 8 8 0 1 0 9 9 0 0 1 And what I would like to get back would be:

Is there an advantage to ordering a categorical variable?

ぐ巨炮叔叔 提交于 2019-12-02 03:10:06
问题 I have been advised that it is best to order categorical variables where appropriate (e.g. short less than medium less than long). I am wondering, what is the specific advantage of treating a categorical variable as ordered as opposed to just simple categorical, in the context of modelling it as an explanatory variable? What does it mean mathematically (in lay terms preferably!)? Many thanks! 回答1: Among other things, it allows you to compare values from those factors: > ord.fac <- ordered(c(

R's caret training errors when y is not a factor

狂风中的少年 提交于 2019-12-02 02:12:01
I am using R-studio and am using kaggle's forest cover data and keep getting an error when trying to use the knn3 function in caret. here is my code: library(caret) train <- read.csv("C:/data/forest_cover/train.csv", header=T) trainingRows <- createDataPartition(train$Cover_Type, p=0.8, list=F) head(trainingRows) train_train <- train[trainingRows,] train_test <- train[-trainingRows,] knnfit <- knn3(train_train[,-56], train_train$Cover_Type) This last line gives me this in the console: Error in knn3.matrix(x, y = y, k = k, ...) : y must be a factor As the error message states, y must be a

R: Expanding an R factor into dummy columns for every factor level

拥有回忆 提交于 2019-12-02 01:55:48
I have a quite big data frame in R with two columns. I am trying to make out of the Code column ( factor type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that. > str(d) 'data.frame': 649226 obs. of 2 variables: $ User: int 210 210 210 210 269 317 317 317 317 326 ... $ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ... The User column is not unique, meaning that there can be several rows with the same User . Doesn't matter if in the end the amount of rows remains the same or the rows with the