categorical-data | 易学教程

How (and why) do you use contrasts?

阅读更多关于 How (and why) do you use contrasts?

问题 Under what cases do you create contrasts in your analysis? How is it done and what is it used for? I checked ?contrasts and ?C - both lead to "Chapter 2 of Statistical Models in S", which is not readily available to me. 回答1: Contrasts are needed when you fit linear models with factors (i.e. categorical variables) as explanatory variables. The contrast specifies how the levels of the factors will be coded into a family of numeric dummy variables for fitting the model. Here are some good notes

How to force R to use a specified factor level as reference in a regression?

阅读更多关于 How to force R to use a specified factor level as reference in a regression?

How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression? It's just using some level by default. lm(x ~ y + as.factor(b)) with b {0, 1, 2, 3, 4} . Let's say I want to use 3 instead of the zero that is used by R. See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the factor b in DF by use of the relevel() function: DF <- within(DF, b <- relevel(b, ref = 3)) m2 <- lm(y ~ x +

How to handle categorical features with spark-ml?

阅读更多关于 How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol . However, the VectorAssembler only accepts

One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

阅读更多关于 One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

问题 This question already has an answer here: All Levels of a Factor in a Model Matrix in R 10 answers I need to create a new data frame nDF that binarizes all categorical variables and at the same time retains all other variables in a data frame DF . For example, I have the following feature variables: RACE (4 types) and AGE, and an output variable called CLASS. DF = RACE AGE (BELOW 21) CLASS Case 1 HISPANIC 0 A Case 2 ASIAN 1 A Case 3 HISPANIC 1 D Case 4 CAUCASIAN 1 B I want to convert this

Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

阅读更多关于 Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

问题 The plotting code below gives Error: Discrete value supplied to continuous scale What\'s wrong with this code? It works fine until I try to change the scale so the error is there... I tried to figure out solutions from similar problem but couldn\'t. This is a head of my data: > dput(head(df)) structure(list(`10` = c(0, 0, 0, 0, 0, 0), `33.95` = c(0, 0, 0, 0, 0, 0), `58.66` = c(0, 0, 0, 0, 0, 0), `84.42` = c(0, 0, 0, 0, 0, 0), `110.21` = c(0, 0, 0, 0, 0, 0), `134.16` = c(0, 0, 0, 0, 0, 0),

Factorize a column of strings in pandas

阅读更多关于 Factorize a column of strings in pandas

问题 As the question says, I have a data frame df_original which is quite large but looks like: ID Count Column 2 Column 3 Column 4 RowX 1 234. 255. yes. 452 RowY 1 123. 135. no. 342 RowW 1 234. 235. yes. 645 RowJ 1 123. 115. no. 342 RowA 1 234. 285. yes. 233 RowR 1 123. 165. no. 342 RowX 2 234. 255. yes. 234 RowY 2 123. 135. yes. 342 RowW 2 234. 235. yes. 233 RowJ 2 123. 115. yes. 342 RowA 2 234. 285. yes. 312 RowR 2 123. 165. no. 342 . . . RowX 1233 234. 255. yes. 133 RowY 1233 123. 135. no. 342

R error “sum not meaningful for factors”

阅读更多关于 R error “sum not meaningful for factors”

问题 I have a file called rRna_RDP_taxonomy_phylum with the following data : 364 \"Firmicutes\" 39.31 244 \"Proteobacteria\" 26.35 218 \"Actinobacteria\" 23.54 65 \"Bacteroidetes\" 7.02 22 \"Fusobacteria\" 2.38 6 \"Thermotogae\" 0.65 3 unclassified_Bacteria 0.32 2 \"Spirochaetes\" 0.22 1 \"Tenericutes\" 0.11 1 Cyanobacteria 0.11 And I\'m using this code for creating a pie chart in R: if(file.exists(\"rRna_RDP_taxonomy_phylum\")){ family <- read.table (\"rRna_RDP_taxonomy_phylum\", sep=\"\\t\")

How to force R to use a specified factor level as reference in a regression?

阅读更多关于 How to force R to use a specified factor level as reference in a regression?

问题 How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression? It\'s just using some level by default. lm(x ~ y + as.factor(b)) with b {0, 1, 2, 3, 4} . Let\'s say I want to use 3 instead of the zero that is used by R. 回答1: See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the

Pandas: convert categories to numbers

阅读更多关于 Pandas: convert categories to numbers

问题 Suppose I have a dataframe with countries that goes as: cc | temp US | 37.0 CA | 12.0 US | 35.0 AU | 20.0 I know that there is a pd.get_dummies function to convert the countries to \'one-hot encodings\'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead. I\'m assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below: [np.where(x) for x in df.cc.get_dummies().values] This is somewhat easier

Pandas: convert categories to numbers

阅读更多关于 Pandas: convert categories to numbers

Suppose I have a dataframe with countries that goes as: cc | temp US | 37.0 CA | 12.0 US | 35.0 AU | 20.0 I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead. I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below: [np.where(x) for x in df.cc.get_dummies().values] This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar. First, change the type of the column