dummy-variable

Separating categories within one column in my dataframe

为君一笑 提交于 2021-02-11 16:44:07
问题 I need to research something about what are the most cost efficient movie genres. My problem is that the genres are provided all within one string: This gives me about 300 different unique categories. How can I split these into about 12 original dummy genre columns so I can analyse each main genre? 回答1: Thanks to Yong Wang who suggested the get_dummies function within pandas. We can shorten the code significantly: df = pd.DataFrame({ 'movie_id': range(5), 'gernes': [ 'Action|Adventure|Fantasy

Separating categories within one column in my dataframe

不想你离开。 提交于 2021-02-11 16:43:41
问题 I need to research something about what are the most cost efficient movie genres. My problem is that the genres are provided all within one string: This gives me about 300 different unique categories. How can I split these into about 12 original dummy genre columns so I can analyse each main genre? 回答1: Thanks to Yong Wang who suggested the get_dummies function within pandas. We can shorten the code significantly: df = pd.DataFrame({ 'movie_id': range(5), 'gernes': [ 'Action|Adventure|Fantasy

R: Generate a dummy variable based on the existence of one column' value in another column

扶醉桌前 提交于 2021-02-08 03:39:36
问题 I have a data frame like this: A B 2012,2013,2014 2011 2012,2013,2014 2012 2012,2013,2014 2013 2012,2013,2014 2014 2012,2013,2014 2015 I wanted to create a dummy variable, which indicates whether the value in column B exists in column A. 1 indicates the existence, and 0 indicates non-existant. Such that, A B dummy 2012,2013,2014 2011 0 2012,2013,2014 2012 1 2012,2013,2014 2013 1 2012,2013,2014 2014 1 2012,2013,2014 2015 0 I have tried to use %in% to achieve this: df$dummy <- ifelse(df$B %in%

R: Generate a dummy variable based on the existence of one column' value in another column

◇◆丶佛笑我妖孽 提交于 2021-02-08 03:37:16
问题 I have a data frame like this: A B 2012,2013,2014 2011 2012,2013,2014 2012 2012,2013,2014 2013 2012,2013,2014 2014 2012,2013,2014 2015 I wanted to create a dummy variable, which indicates whether the value in column B exists in column A. 1 indicates the existence, and 0 indicates non-existant. Such that, A B dummy 2012,2013,2014 2011 0 2012,2013,2014 2012 1 2012,2013,2014 2013 1 2012,2013,2014 2014 1 2012,2013,2014 2015 0 I have tried to use %in% to achieve this: df$dummy <- ifelse(df$B %in%

Thoughts on Generating an Age Variable Based on Years

不想你离开。 提交于 2021-02-05 11:30:06
问题 I am trying to create a dummy variable for years. Currently, my data has a birth_date and a program start_date for each observation. I have been able to create a variable measuring an individual's age in days, but what I am actually looking for is a variable: age_join_date that tells me the following: Individual birth_date start_date age_at_join_date A 1990-12-31 2010-12-31 20 yrs old B 1990-12-31 2011-12-31 21 yrs old Essentially what I care about is one's age at the time they joined the

Create numerically encoded dummy variables efficiently in R?

。_饼干妹妹 提交于 2021-02-04 19:58:55
问题 How can we transform data of the form df <- structure(list(customer_number = c(3, 3, 1, 1, 3), item = c("milkshake","burger", "apple", "burger", "water") ), row.names = c(NA, -5L), class = "data.frame") # customer_number item # 1 3 milkshake # 2 3 burger # 3 1 apple # 4 1 burger # 5 3 water into numerically encoded dummy variables, like this data.frame(customer_number=c(1,3), item_milkshake=c(0,1), item_burger=c(1,1), item_apple=c(1,0), item_water=c(0,1)) # customer_number item_milkshake item

Create dummy variable of multiple columns with python

两盒软妹~` 提交于 2021-02-04 18:31:10
问题 I am working with a dataframe containing two columns with ID numbers. For further research I want to make a sort of dummy variables of these ID numbers (with the two ID numbers). My code, however, does not merge the columns from the two dataframes. How can I merge the columns from the two dataframes and create the dummy variables? Dataframe import pandas as pd import numpy as np d = {'ID1': [1,2,3], 'ID2': [2,3,4]} df = pd.DataFrame(data=d) Current code pd.get_dummies(df, prefix = ['ID1',

NA values when regressing with dummy variable interaction term

风流意气都作罢 提交于 2020-12-30 03:36:05
问题 I'm trying to estimate factors that determine the difference in happiness level between people living in New York and Chicago. Data looks like below. Happiness City Gender Employment Worktype Holiday 1 60 New York 0 0 Unemployed Unemployed 2 80 Chicago 1 1 Whitecolor 1 day a week 3 39 Chicago 0 0 Unemployed Unemployed 4 40 New York 1 0 Unemployed Unemployed 5 69 Chicago 1 1 Bluecolor 2 day a week 6 90 Chicago 1 1 Bluecolor 2 day a week 7 100 New York 0 1 Whitecolor 2 day a week 8 30 New York

One-hot encoding for words which occur in multiple columns

感情迁移 提交于 2020-12-13 04:50:05
问题 I want to create on-hot encoded data from categorical data, which you can see here. Label1 Label2 Label3 0 Street fashion Clothing Fashion 1 Clothing Outerwear Jeans 2 Architecture Property Clothing 3 Clothing Black Footwear 4 White Photograph Beauty The problem (for me) is that one specific label (e.g. clothing) can be in label1, label2 or label 3. I tried pd.get_dummies but this created data like: Label1_Clothing Label2_Clothing Label3_Clothing 0 0 1 0 1 1 0 0 2 0 0 1 Is there a way to only

Standardized regression coefficients with dummy variables in R vs. SPSS

谁说我不能喝 提交于 2020-06-17 02:03:07
问题 I came across a puzzling difference in standardized (beta) coefficients with linear regression model computed with R and SPSS using dummy coded variables. I have used the hsb2 data set and created a contrast (dummy coding), so that the third category is the reference. Here is the R code: # Read the data hsb2 <- read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = ",") # Create a factor variable with respondents' race hsb2$race.f <- factor(hsb2$race, labels = c(