categorical-data

How (and why) do you use contrasts?

笑着哭i 提交于 2019-11-26 15:41:37
问题 Under what cases do you create contrasts in your analysis? How is it done and what is it used for? I checked ?contrasts and ?C - both lead to "Chapter 2 of Statistical Models in S", which is not readily available to me. 回答1: Contrasts are needed when you fit linear models with factors (i.e. categorical variables) as explanatory variables. The contrast specifies how the levels of the factors will be coded into a family of numeric dummy variables for fitting the model. Here are some good notes

How to force R to use a specified factor level as reference in a regression?

徘徊边缘 提交于 2019-11-26 14:56:32
How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression? It's just using some level by default. lm(x ~ y + as.factor(b)) with b {0, 1, 2, 3, 4} . Let's say I want to use 3 instead of the zero that is used by R. See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the factor b in DF by use of the relevel() function: DF <- within(DF, b <- relevel(b, ref = 3)) m2 <- lm(y ~ x +

How to handle categorical features with spark-ml?

删除回忆录丶 提交于 2019-11-26 14:27:08
How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol . However, the VectorAssembler only accepts

One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

☆樱花仙子☆ 提交于 2019-11-26 11:06:51
问题 This question already has an answer here: All Levels of a Factor in a Model Matrix in R 10 answers I need to create a new data frame nDF that binarizes all categorical variables and at the same time retains all other variables in a data frame DF . For example, I have the following feature variables: RACE (4 types) and AGE, and an output variable called CLASS. DF = RACE AGE (BELOW 21) CLASS Case 1 HISPANIC 0 A Case 2 ASIAN 1 A Case 3 HISPANIC 1 D Case 4 CAUCASIAN 1 B I want to convert this

Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

家住魔仙堡 提交于 2019-11-26 09:41:01
问题 The plotting code below gives Error: Discrete value supplied to continuous scale What\'s wrong with this code? It works fine until I try to change the scale so the error is there... I tried to figure out solutions from similar problem but couldn\'t. This is a head of my data: > dput(head(df)) structure(list(`10` = c(0, 0, 0, 0, 0, 0), `33.95` = c(0, 0, 0, 0, 0, 0), `58.66` = c(0, 0, 0, 0, 0, 0), `84.42` = c(0, 0, 0, 0, 0, 0), `110.21` = c(0, 0, 0, 0, 0, 0), `134.16` = c(0, 0, 0, 0, 0, 0),

Factorize a column of strings in pandas

非 Y 不嫁゛ 提交于 2019-11-26 06:49:15
问题 As the question says, I have a data frame df_original which is quite large but looks like: ID Count Column 2 Column 3 Column 4 RowX 1 234. 255. yes. 452 RowY 1 123. 135. no. 342 RowW 1 234. 235. yes. 645 RowJ 1 123. 115. no. 342 RowA 1 234. 285. yes. 233 RowR 1 123. 165. no. 342 RowX 2 234. 255. yes. 234 RowY 2 123. 135. yes. 342 RowW 2 234. 235. yes. 233 RowJ 2 123. 115. yes. 342 RowA 2 234. 285. yes. 312 RowR 2 123. 165. no. 342 . . . RowX 1233 234. 255. yes. 133 RowY 1233 123. 135. no. 342

R error “sum not meaningful for factors”

痞子三分冷 提交于 2019-11-26 05:37:49
问题 I have a file called rRna_RDP_taxonomy_phylum with the following data : 364 \"Firmicutes\" 39.31 244 \"Proteobacteria\" 26.35 218 \"Actinobacteria\" 23.54 65 \"Bacteroidetes\" 7.02 22 \"Fusobacteria\" 2.38 6 \"Thermotogae\" 0.65 3 unclassified_Bacteria 0.32 2 \"Spirochaetes\" 0.22 1 \"Tenericutes\" 0.11 1 Cyanobacteria 0.11 And I\'m using this code for creating a pie chart in R: if(file.exists(\"rRna_RDP_taxonomy_phylum\")){ family <- read.table (\"rRna_RDP_taxonomy_phylum\", sep=\"\\t\")

How to force R to use a specified factor level as reference in a regression?

眉间皱痕 提交于 2019-11-26 03:04:07
问题 How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression? It\'s just using some level by default. lm(x ~ y + as.factor(b)) with b {0, 1, 2, 3, 4} . Let\'s say I want to use 3 instead of the zero that is used by R. 回答1: See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the

Pandas: convert categories to numbers

霸气de小男生 提交于 2019-11-26 01:59:21
问题 Suppose I have a dataframe with countries that goes as: cc | temp US | 37.0 CA | 12.0 US | 35.0 AU | 20.0 I know that there is a pd.get_dummies function to convert the countries to \'one-hot encodings\'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead. I\'m assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below: [np.where(x) for x in df.cc.get_dummies().values] This is somewhat easier

Pandas: convert categories to numbers

这一生的挚爱 提交于 2019-11-26 01:46:14
Suppose I have a dataframe with countries that goes as: cc | temp US | 37.0 CA | 12.0 US | 35.0 AU | 20.0 I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead. I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below: [np.where(x) for x in df.cc.get_dummies().values] This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar. First, change the type of the column