categorical-data | 易学教程

Pandas DataFrame sort by categorical column but by specific class ordering

阅读更多关于 Pandas DataFrame sort by categorical column but by specific class ordering

I would like to select the top entries in a Pandas dataframe base on the entries of a specific column by using df_selected = df_targets.head(N) . Each entry has a target value (by order of importance): Likely Supporter, GOTV, Persuasion, Persuasion+GOTV Unfortunately if I do df_targets = df_targets.sort("target") the ordering will be alphabetical ( GOTV , Likely Supporter , ...). I was hoping for a keyword like list_ordering as in: my_list = ["Likely Supporter", "GOTV", "Persuasion", "Persuasion+GOTV"] df_targets = df_targets.sort("target", list_ordering=my_list) To deal with this issue I

Reduce number of levels for large categorical variables

阅读更多关于 Reduce number of levels for large categorical variables

Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors? I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other". Here is an example in R using data.table a bit, but it should be easy without data.table also. # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)), weight = rnorm(n = 10e3, mean = 70, sd = 20)) # Decide the minimum frequency a level needs... min

How can I ensure that a partition has representative observations from each level of a factor?

阅读更多关于 How can I ensure that a partition has representative observations from each level of a factor?

I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample(letters, 100, rep = T)), c = factor(sample(c("apple", "orange"), 100, rep = T))) set.seed(123) partition

Generating Multiple Plots in ggplot by Factor

阅读更多关于 Generating Multiple Plots in ggplot by Factor

I have a data set that I want to generate multiple plots for based on one of the columns. That is, I want to be able to use ggplot to make a separate plot for each variety of that factor. Here's some quick sample data: Variety = as.factor(c("a","b","a","b","a","b","a","b","a","b") Var1 = runif(10) Var2 = runif(10) mydata = as.data.frame(cbind(Variety,Var1,Var2)) I'd like to generate two separate plots of Var1 over Var2, one for Variety A, a second for Variety B, preferably in a single command, but if there's a way to do it without splitting the table, that would be ok as well. You can use

Rename the less frequent categories by “OTHER” python

阅读更多关于 Rename the less frequent categories by “OTHER” python

问题 In my dataframe I have some categorical columns with over 100 different categories. I want to rank the categories by the most frequent. I keep the first 9 most frequent categories and the less frequent categories rename them automatically by: OTHER Example: Here my df : print(df) Employee_number Jobrol 0 1 Sales Executive 1 2 Research Scientist 2 3 Laboratory Technician 3 4 Sales Executive 4 5 Research Scientist 5 6 Laboratory Technician 6 7 Sales Executive 7 8 Research Scientist 8 9

Issue with OneHotEncoder for categorical features

阅读更多关于 Issue with OneHotEncoder for categorical features

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following: from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values) However, I couldn't proceed as I am getting this error: array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG I am surprised why it is complaining about the string as it is supposed to convert it!! Am I

Legend of a raster map with categorical data

阅读更多关于 Legend of a raster map with categorical data

I would like to plot a raster containing 4 different values ( 1 ) with a categorical text legend describing the categories such as 2 but with colour boxes: I've tried using legend such as : legend( 1,-20,legend = c("land","ocean/lake", "rivers","water bodies")) but I don't know how to associate one value to the displayed color. Is there a way to retrieve the colour displayed with 'plot' and to use it in the legend? The rasterVis package includes a Raster method for levelplot() , which plots categorical variables, and produces an appropriate legend: library(raster) library(rasterVis) ## Example

Create dummies from column with multiple values in pandas

阅读更多关于 Create dummies from column with multiple values in pandas

I am looking for for a pythonic way to handle the following problem. The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'] , get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly. Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s. Is there a way to handle this in a pythonic way? I could only think of some

Add extra level to factors in dataframe

阅读更多关于 Add extra level to factors in dataframe

I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know how to finish it in an elegant way: addNoAnswer = function(df) { factorOrNot = sapply(df, is.factor) levelsList = lapply(df[, factorOrNot], levels) levelsList = lapply(levelsList, function(x) c(x, "No Answer")) ... Is there a way to directly apply new levels to factor columns, for example, something like this: df[, factorOrNot] = lapply(df[,

Make Frequency Histogram for Factor Variables

阅读更多关于 Make Frequency Histogram for Factor Variables

I am very new to R, so I apologize for such a basic question. I spent an hour googling this issue, but couldn't find a solution. Say I have some categorical data in my data set about common pet types. I input it as a character vector in R that contains the names of different types of animals. I created it like this: animals <- c("cat", "dog", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "bird") I turn it into a factor for use with other vectors in my data frame: animalFactor <- as.factor(animals) I now want to create a histogram that shows the frequency of each variable on the y

订阅 categorical-data