categorical-data | 易学教程

Reveal k-modes cluster features

阅读更多关于 Reveal k-modes cluster features

I'm performing a cluster analysis on categorical data, hence using k-modes approach. My data is shaped as a preference survey: How do you like hair and eyes? The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility. I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca. My code looks like: import numpy as np import pandas as pd from kmodes import kmodes df_dummy = pd.get_dummies(df) #transform into numpy array x = df_dummy.reset_index().values km = kmodes.KModes(n_clusters=3, init='Huang', n

Crosstab with multiple items

阅读更多关于 Crosstab with multiple items

In SPSS, it is (relatively) easy to create a cross tab with multiple variables using the factors (or values) as the table heading. So, something like the following (made up data, etc.). Q1, Q2, and Q3 each have either a 1, a 2, or a 3 for each person. I just left these as numbers, but they could be factors, neither seemed to help solve the problem. 1 (very Often) 2 (Rarely) 3 (Never) Q1. Likes it 12 15 13 Q2. Recommends it 22 11 10 Q3. Used it 22 12 9 In SPSS, one can even request row, column, or total percentages. I've tried table(), ftable(), xtab(), CrossTable() from gmodels, and CrossTable

Problems with a binary one-hot (one-of-K) coding in python

阅读更多关于 Problems with a binary one-hot (one-of-K) coding in python

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me. Pandas and get_dummies in the

Python equivalent of daisy() in the cluster package of R

阅读更多关于 Python equivalent of daisy() in the cluster package of R

I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows: if(!require("cluster")) { install.packages("cluster"); require("cluster") } data(flower) as.matrix(daisy(flower, metric = "gower")) This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R? Or maybe any other module function that

Convert multiple binary columns to single categorical column [duplicate]

阅读更多关于 Convert multiple binary columns to single categorical column [duplicate]

问题 This question already has answers here : For each row return the column name of the largest value (7 answers) Closed last year . I have a table full of binary variables that I would like to condense down to categorical variables. Very simplistically, I have is a data frame like this: data <- data.frame(id=c(1,2,3,4,5,6,7,8,9), red=c("1","0","0","0","1","0","0","0","0"),blue=c("0","1","1","1","0","1","1","1","0"),yellow=c("0","0","0","0","0","0","0","0","1")) data id red blue yellow 1 1 1 0 0

R's caret training errors when y is not a factor

阅读更多关于 R's caret training errors when y is not a factor

问题 I am using R-studio and am using kaggle's forest cover data and keep getting an error when trying to use the knn3 function in caret. here is my code: library(caret) train <- read.csv("C:/data/forest_cover/train.csv", header=T) trainingRows <- createDataPartition(train$Cover_Type, p=0.8, list=F) head(trainingRows) train_train <- train[trainingRows,] train_test <- train[-trainingRows,] knnfit <- knn3(train_train[,-56], train_train$Cover_Type) This last line gives me this in the console: Error

Convert multiple binary columns to single categorical column [duplicate]

阅读更多关于 Convert multiple binary columns to single categorical column [duplicate]

This question already has an answer here: For each row return the column name of the largest value 7 answers I have a table full of binary variables that I would like to condense down to categorical variables. Very simplistically, I have is a data frame like this: data <- data.frame(id=c(1,2,3,4,5,6,7,8,9), red=c("1","0","0","0","1","0","0","0","0"),blue=c("0","1","1","1","0","1","1","1","0"),yellow=c("0","0","0","0","0","0","0","0","1")) data id red blue yellow 1 1 1 0 0 2 2 0 1 0 3 3 0 1 0 4 4 0 1 0 5 5 1 0 0 6 6 0 1 0 7 7 0 1 0 8 8 0 1 0 9 9 0 0 1 And what I would like to get back would be:

Is there an advantage to ordering a categorical variable?

阅读更多关于 Is there an advantage to ordering a categorical variable?

问题 I have been advised that it is best to order categorical variables where appropriate (e.g. short less than medium less than long). I am wondering, what is the specific advantage of treating a categorical variable as ordered as opposed to just simple categorical, in the context of modelling it as an explanatory variable? What does it mean mathematically (in lay terms preferably!)? Many thanks! 回答1: Among other things, it allows you to compare values from those factors: > ord.fac <- ordered(c(

R's caret training errors when y is not a factor

阅读更多关于 R's caret training errors when y is not a factor

I am using R-studio and am using kaggle's forest cover data and keep getting an error when trying to use the knn3 function in caret. here is my code: library(caret) train <- read.csv("C:/data/forest_cover/train.csv", header=T) trainingRows <- createDataPartition(train$Cover_Type, p=0.8, list=F) head(trainingRows) train_train <- train[trainingRows,] train_test <- train[-trainingRows,] knnfit <- knn3(train_train[,-56], train_train$Cover_Type) This last line gives me this in the console: Error in knn3.matrix(x, y = y, k = k, ...) : y must be a factor As the error message states, y must be a

R: Expanding an R factor into dummy columns for every factor level

阅读更多关于 R: Expanding an R factor into dummy columns for every factor level

I have a quite big data frame in R with two columns. I am trying to make out of the Code column ( factor type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that. > str(d) 'data.frame': 649226 obs. of 2 variables: $ User: int 210 210 210 210 269 317 317 317 317 326 ... $ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ... The User column is not unique, meaning that there can be several rows with the same User . Doesn't matter if in the end the amount of rows remains the same or the rows with the