categorical-data | 易学教程

How (and why) do you use contrasts?

阅读更多关于 How (and why) do you use contrasts?

Under what cases do you create contrasts in your analysis? How is it done and what is it used for? I checked ?contrasts and ?C - both lead to "Chapter 2 of Statistical Models in S", which is not readily available to me. Contrasts are needed when you fit linear models with factors (i.e. categorical variables) as explanatory variables. The contrast specifies how the levels of the factors will be coded into a family of numeric dummy variables for fitting the model. Here are some good notes for the different varieties of contrasts used: http://www.unc.edu/courses/2006spring/ecol/145/001/docs

How to subplot seaborn catplot (kind='count') on-top of catplot (kind='violin') with sharex=True

阅读更多关于 How to subplot seaborn catplot (kind='count') on-top of catplot (kind='violin') with sharex=True

问题 So far I have tried the following code: # Import to handle plotting import seaborn as sns # Import pyplot, figures inline, set style, plot pairplot import matplotlib.pyplot as plt # Make the figure space fig = plt.figure(figsize=(2,4)) gs = fig.add_gridspec(2, 4) ax1 = fig.add_subplot(gs[0, :]) ax2 = fig.add_subplot(gs[1, :]) # Load the example car crash dataset tips = sns.load_dataset("tips") # Plot the frequency counts grouped by time sns.catplot(x='sex', hue='smoker', kind='count', col=

Reduce number of levels for large categorical variables

阅读更多关于 Reduce number of levels for large categorical variables

问题 Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors? I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other". 回答1: Here is an example in R using data.table a bit, but it should be easy without data.table also. # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace

Combining low frequency counts

阅读更多关于 Combining low frequency counts

问题 Trying to collapse a nominal categorical vector by combining low frequency counts into an 'Other' category: The data (column of a dataframe) looks like this, and contains information for all 50 states: California Florida Alabama ... table(colname)/length(colname) correctly returns the frequencies, and what I'm trying to do is to lump anything below a given threshold (say f=0.02) together. What is the correct approach? 回答1: From the sounds of it, something like the following should work for

Factorize a column of strings in pandas

阅读更多关于 Factorize a column of strings in pandas

As the question says, I have a data frame df_original which is quite large but looks like: ID Count Column 2 Column 3 Column 4 RowX 1 234. 255. yes. 452 RowY 1 123. 135. no. 342 RowW 1 234. 235. yes. 645 RowJ 1 123. 115. no. 342 RowA 1 234. 285. yes. 233 RowR 1 123. 165. no. 342 RowX 2 234. 255. yes. 234 RowY 2 123. 135. yes. 342 RowW 2 234. 235. yes. 233 RowJ 2 123. 115. yes. 342 RowA 2 234. 285. yes. 312 RowR 2 123. 165. no. 342 . . . RowX 1233 234. 255. yes. 133 RowY 1233 123. 135. no. 342 RowW 1233 234. 235. no. 253 RowJ 1233 123. 115. yes. 342 RowA 1233 234. 285. yes. 645 RowR 1233 123.

Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

阅读更多关于 Plotting with ggplot2: “Error: Discrete value supplied to continuous scale” on categorical y-axis

The plotting code below gives Error: Discrete value supplied to continuous scale What's wrong with this code? It works fine until I try to change the scale so the error is there... I tried to figure out solutions from similar problem but couldn't. This is a head of my data: > dput(head(df)) structure(list(`10` = c(0, 0, 0, 0, 0, 0), `33.95` = c(0, 0, 0, 0, 0, 0), `58.66` = c(0, 0, 0, 0, 0, 0), `84.42` = c(0, 0, 0, 0, 0, 0), `110.21` = c(0, 0, 0, 0, 0, 0), `134.16` = c(0, 0, 0, 0, 0, 0), `164.69` = c(0, 0, 0, 0, 0, 0), `199.1` = c(0, 0, 0, 0, 0, 0), `234.35` = c(0, 0, 0, 0, 0, 0), `257.19` = c

One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

阅读更多关于 One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]

This question already has an answer here: All Levels of a Factor in a Model Matrix in R 10 answers I need to create a new data frame nDF that binarizes all categorical variables and at the same time retains all other variables in a data frame DF . For example, I have the following feature variables: RACE (4 types) and AGE, and an output variable called CLASS. DF = RACE AGE (BELOW 21) CLASS Case 1 HISPANIC 0 A Case 2 ASIAN 1 A Case 3 HISPANIC 1 D Case 4 CAUCASIAN 1 B I want to convert this into nDF with five (5) variables or four (4) even: RACE.1 RACE.2 RACE.3 AGE (BELOW 21) CLASS Case 1 0 0 0

Legend of a raster map with categorical data

阅读更多关于 Legend of a raster map with categorical data

问题 I would like to plot a raster containing 4 different values (1) with a categorical text legend describing the categories such as 2 but with colour boxes: I've tried using legend such as : legend( 1,-20,legend = c("land","ocean/lake", "rivers","water bodies")) but I don't know how to associate one value to the displayed color. Is there a way to retrieve the colour displayed with 'plot' and to use it in the legend? 回答1: The rasterVis package includes a Raster method for levelplot() , which

Add extra level to factors in dataframe

阅读更多关于 Add extra level to factors in dataframe

问题 I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know how to finish it in an elegant way: addNoAnswer = function(df) { factorOrNot = sapply(df, is.factor) levelsList = lapply(df[, factorOrNot], levels) levelsList = lapply(levelsList, function(x) c(x, "No Answer")) ... Is there a way to directly apply

Make Frequency Histogram for Factor Variables

阅读更多关于 Make Frequency Histogram for Factor Variables

问题 I am very new to R, so I apologize for such a basic question. I spent an hour googling this issue, but couldn't find a solution. Say I have some categorical data in my data set about common pet types. I input it as a character vector in R that contains the names of different types of animals. I created it like this: animals <- c("cat", "dog", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "bird") I turn it into a factor for use with other vectors in my data frame: animalFactor <- as