categorical-data

How (and why) do you use contrasts?

二次信任 提交于 2019-11-27 11:40:21
Under what cases do you create contrasts in your analysis? How is it done and what is it used for? I checked ?contrasts and ?C - both lead to "Chapter 2 of Statistical Models in S", which is not readily available to me. Contrasts are needed when you fit linear models with factors (i.e. categorical variables) as explanatory variables. The contrast specifies how the levels of the factors will be coded into a family of numeric dummy variables for fitting the model. Here are some good notes for the different varieties of contrasts used: http://www.unc.edu/courses/2006spring/ecol/145/001/docs

How to subplot seaborn catplot (kind='count') on-top of catplot (kind='violin') with sharex=True

十年热恋 提交于 2019-11-27 08:11:14
问题 So far I have tried the following code: # Import to handle plotting import seaborn as sns # Import pyplot, figures inline, set style, plot pairplot import matplotlib.pyplot as plt # Make the figure space fig = plt.figure(figsize=(2,4)) gs = fig.add_gridspec(2, 4) ax1 = fig.add_subplot(gs[0, :]) ax2 = fig.add_subplot(gs[1, :]) # Load the example car crash dataset tips = sns.load_dataset("tips") # Plot the frequency counts grouped by time sns.catplot(x='sex', hue='smoker', kind='count', col=

Reduce number of levels for large categorical variables

删除回忆录丶 提交于 2019-11-27 07:29:52
问题 Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors? I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other". 回答1: Here is an example in R using data.table a bit, but it should be easy without data.table also. # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace

Combining low frequency counts

江枫思渺然 提交于 2019-11-27 07:07:35
问题 Trying to collapse a nominal categorical vector by combining low frequency counts into an 'Other' category: The data (column of a dataframe) looks like this, and contains information for all 50 states: California Florida Alabama ... table(colname)/length(colname) correctly returns the frequencies, and what I'm trying to do is to lump anything below a given threshold (say f=0.02) together. What is the correct approach? 回答1: From the sounds of it, something like the following should work for

Factorize a column of strings in pandas

北城以北 提交于 2019-11-27 05:39:02
As the question says, I have a data frame df_original which is quite large but looks like: ID Count Column 2 Column 3 Column 4 RowX 1 234. 255. yes. 452 RowY 1 123. 135. no. 342 RowW 1 234. 235. yes. 645 RowJ 1 123. 115. no. 342 RowA 1 234. 285. yes. 233 RowR 1 123. 165. no. 342 RowX 2 234. 255. yes. 234 RowY 2 123. 135. yes. 342 RowW 2 234. 235. yes. 233 RowJ 2 123. 115. yes. 342 RowA 2 234. 285. yes. 312 RowR 2 123. 165. no. 342 . . . RowX 1233 234. 255. yes. 133 RowY 1233 123. 135. no. 342 RowW 1233 234. 235. no. 253 RowJ 1233 123. 115. yes. 342 RowA 1233 234. 285. yes. 645 RowR 1233 123.

Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

不问归期 提交于 2019-11-27 04:28:53
The plotting code below gives Error: Discrete value supplied to continuous scale What's wrong with this code? It works fine until I try to change the scale so the error is there... I tried to figure out solutions from similar problem but couldn't. This is a head of my data: > dput(head(df)) structure(list(`10` = c(0, 0, 0, 0, 0, 0), `33.95` = c(0, 0, 0, 0, 0, 0), `58.66` = c(0, 0, 0, 0, 0, 0), `84.42` = c(0, 0, 0, 0, 0, 0), `110.21` = c(0, 0, 0, 0, 0, 0), `134.16` = c(0, 0, 0, 0, 0, 0), `164.69` = c(0, 0, 0, 0, 0, 0), `199.1` = c(0, 0, 0, 0, 0, 0), `234.35` = c(0, 0, 0, 0, 0, 0), `257.19` = c

One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

别来无恙 提交于 2019-11-27 04:15:21
This question already has an answer here: All Levels of a Factor in a Model Matrix in R 10 answers I need to create a new data frame nDF that binarizes all categorical variables and at the same time retains all other variables in a data frame DF . For example, I have the following feature variables: RACE (4 types) and AGE, and an output variable called CLASS. DF = RACE AGE (BELOW 21) CLASS Case 1 HISPANIC 0 A Case 2 ASIAN 1 A Case 3 HISPANIC 1 D Case 4 CAUCASIAN 1 B I want to convert this into nDF with five (5) variables or four (4) even: RACE.1 RACE.2 RACE.3 AGE (BELOW 21) CLASS Case 1 0 0 0

Legend of a raster map with categorical data

十年热恋 提交于 2019-11-26 18:17:16
问题 I would like to plot a raster containing 4 different values (1) with a categorical text legend describing the categories such as 2 but with colour boxes: I've tried using legend such as : legend( 1,-20,legend = c("land","ocean/lake", "rivers","water bodies")) but I don't know how to associate one value to the displayed color. Is there a way to retrieve the colour displayed with 'plot' and to use it in the legend? 回答1: The rasterVis package includes a Raster method for levelplot() , which

Add extra level to factors in dataframe

大兔子大兔子 提交于 2019-11-26 15:59:55
问题 I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know how to finish it in an elegant way: addNoAnswer = function(df) { factorOrNot = sapply(df, is.factor) levelsList = lapply(df[, factorOrNot], levels) levelsList = lapply(levelsList, function(x) c(x, "No Answer")) ... Is there a way to directly apply

Make Frequency Histogram for Factor Variables

北城余情 提交于 2019-11-26 15:45:33
问题 I am very new to R, so I apologize for such a basic question. I spent an hour googling this issue, but couldn't find a solution. Say I have some categorical data in my data set about common pet types. I input it as a character vector in R that contains the names of different types of animals. I created it like this: animals <- c("cat", "dog", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "bird") I turn it into a factor for use with other vectors in my data frame: animalFactor <- as