dummy-variable

Create a matrix of dummy variables from my data frame; use `NA` for missing values

旧城冷巷雨未停 提交于 2019-12-24 01:05:12
问题 I have a data based on different years, repeated several time. I want my output having columns equal to number of years, each column for one year. Now, the purpose is to create dummy for each year separately. For example, the output column for year 2000 must have a value "1" whenever there is a non-NA observation in the main data parallel to year 2000, else "0". Moreover, NA must remain NA. Please see below a small sample of input data: df: 2000 NA 2001 NA 2002 -1.3 2000 1.1 2001 0 2002 NA

Speed up this loop to create dummy columns with data.table and set in R [duplicate]

南楼画角 提交于 2019-12-19 08:28:52
问题 This question already has an answer here : Creating dummy variables in R data.table (1 answer) Closed 3 years ago . I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set? Here is an example dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday",

Creating categorical variables from mutually exclusive dummy variables

让人想犯罪 __ 提交于 2019-12-19 05:45:38
问题 My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable. In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here

Keep same dummy variable in training and testing data

ぐ巨炮叔叔 提交于 2019-12-17 07:11:59
问题 I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...]. To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data. I do the same transformation on my test data and predict

Dummify character column and find unique values [duplicate]

橙三吉。 提交于 2019-12-17 02:32:10
问题 This question already has an answer here : Split a column into multiple binary dummy columns [duplicate] (1 answer) Closed last year . I have a dataframe with the following structure test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;')) Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the

Creating dummy variables in R based on multiple chr values within each cell

為{幸葍}努か 提交于 2019-12-13 07:15:49
问题 I'm trying to create multiple dummy variables, based on one column called 'Tags' within my df (14 rows, 2 columns, Score and Tags. My problem is that in each cell there can be any number of chr values (up to about 30 values). When I ask for: str(df$Tags) R returns: chr [1:14] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolera"| __truncated__ ... And when I ask for: df$Tags[1] R returns: [1] "\"biologische gerechten\

How to apply linear regresssion of sklearn for some string variable

北城余情 提交于 2019-12-13 06:40:58
问题 I am going to predict the box office of a movie using logistic regression. I got some train data including the actors and directors. This is my datas: Director1|Actor1|300 million Director2|Actor2|500 million I am going to encode the directors and actors using integers. 1|1|300 million 2|2|300 million Which means that X={[1,1],[2,2]} y=[300,500] and fit(X,y) Does that work? 回答1: You cannot use categorical variables in linear regression like that. Linear regression treats all variables like

Highlighting periods using a dummy variable in ggplot2

大憨熊 提交于 2019-12-13 03:48:29
问题 I have two time series yield and fx and a dummy. How can I plot the two series in ggplot and highlight (shade) the areas where the dummy is 1? The header of the data set below. date dummy yield fx 1/1/1990 0 10.029 1.261184049 1/2/1990 0 10.036 1.261008068 1/3/1990 0 10.119 1.258932591 1/4/1990 0 10.02 1.261410528 1/5/1990 0 10.013 1.261586847 1/6/1990 1 10.066 1.260255526 1/7/1990 1 10.057 1.260481006 1/8/1990 1 10.057 1.260481006 1/9/1990 1 10.067 1.260230488 1/10/1990 1 10.186 1.257272051

transform date into dummy variable in R [duplicate]

时光总嘲笑我的痴心妄想 提交于 2019-12-13 02:58:01
问题 This question already has answers here : Find the day of a week (7 answers) Generate a dummy-variable (16 answers) Closed last year . I have this dataset df=structure(list(Data = structure(c(4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L), .Label = c("01.01.2018", "02.01.2018", "03.01.2018", "25.12.2017", "26.12.2017", "27.12.2017", "28.12.2017", "29.12.2017", "30.12.2017", "31.12.2017"), class = "factor"), Y = 1:10), .Names = c("Data", "Y"), class = "data.frame", row.names = c(NA, -10L)) Date is

ValueError: Columns must be same length as key

自闭症网瘾萝莉.ら 提交于 2019-12-12 19:24:39
问题 I have a problem running the code below. data is my dataframe. X is the list of columns for train data. And L is a list of categorical features with numeric values. I want to one hot encode my categorical features. So I do as follows. But a "ValueError: Columns must be same length as key" (for the last line) is thrown. And I still don't understand why after long research. def turn_dummy(df, prop): dummies = pd.get_dummies(df[prop], prefix=prop, sparse=True) df.drop(prop, axis=1, inplace=True)