categorical-data

Pandas DataFrame sort by categorical column but by specific class ordering

落花浮王杯 提交于 2019-11-29 10:50:44
I would like to select the top entries in a Pandas dataframe base on the entries of a specific column by using df_selected = df_targets.head(N) . Each entry has a target value (by order of importance): Likely Supporter, GOTV, Persuasion, Persuasion+GOTV Unfortunately if I do df_targets = df_targets.sort("target") the ordering will be alphabetical ( GOTV , Likely Supporter , ...). I was hoping for a keyword like list_ordering as in: my_list = ["Likely Supporter", "GOTV", "Persuasion", "Persuasion+GOTV"] df_targets = df_targets.sort("target", list_ordering=my_list) To deal with this issue I

Reduce number of levels for large categorical variables

岁酱吖の 提交于 2019-11-28 13:03:22
Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors? I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other". Here is an example in R using data.table a bit, but it should be easy without data.table also. # Load data.table require(data.table) # Some data set.seed(1) dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)), weight = rnorm(n = 10e3, mean = 70, sd = 20)) # Decide the minimum frequency a level needs... min

How can I ensure that a partition has representative observations from each level of a factor?

时光总嘲笑我的痴心妄想 提交于 2019-11-28 10:28:39
I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample(letters, 100, rep = T)), c = factor(sample(c("apple", "orange"), 100, rep = T))) set.seed(123) partition

Generating Multiple Plots in ggplot by Factor

那年仲夏 提交于 2019-11-28 08:28:20
I have a data set that I want to generate multiple plots for based on one of the columns. That is, I want to be able to use ggplot to make a separate plot for each variety of that factor. Here's some quick sample data: Variety = as.factor(c("a","b","a","b","a","b","a","b","a","b") Var1 = runif(10) Var2 = runif(10) mydata = as.data.frame(cbind(Variety,Var1,Var2)) I'd like to generate two separate plots of Var1 over Var2, one for Variety A, a second for Variety B, preferably in a single command, but if there's a way to do it without splitting the table, that would be ok as well. You can use

Rename the less frequent categories by “OTHER” python

放肆的年华 提交于 2019-11-28 04:14:19
问题 In my dataframe I have some categorical columns with over 100 different categories. I want to rank the categories by the most frequent. I keep the first 9 most frequent categories and the less frequent categories rename them automatically by: OTHER Example: Here my df : print(df) Employee_number Jobrol 0 1 Sales Executive 1 2 Research Scientist 2 3 Laboratory Technician 3 4 Sales Executive 4 5 Research Scientist 5 6 Laboratory Technician 6 7 Sales Executive 7 8 Research Scientist 8 9

Issue with OneHotEncoder for categorical features

时光怂恿深爱的人放手 提交于 2019-11-27 21:07:25
I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following: from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values) However, I couldn't proceed as I am getting this error: array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG I am surprised why it is complaining about the string as it is supposed to convert it!! Am I

Legend of a raster map with categorical data

与世无争的帅哥 提交于 2019-11-27 20:16:37
I would like to plot a raster containing 4 different values ( 1 ) with a categorical text legend describing the categories such as 2 but with colour boxes: I've tried using legend such as : legend( 1,-20,legend = c("land","ocean/lake", "rivers","water bodies")) but I don't know how to associate one value to the displayed color. Is there a way to retrieve the colour displayed with 'plot' and to use it in the legend? The rasterVis package includes a Raster method for levelplot() , which plots categorical variables, and produces an appropriate legend: library(raster) library(rasterVis) ## Example

Create dummies from column with multiple values in pandas

对着背影说爱祢 提交于 2019-11-27 17:31:54
I am looking for for a pythonic way to handle the following problem. The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'] , get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly. Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s. Is there a way to handle this in a pythonic way? I could only think of some

Add extra level to factors in dataframe

允我心安 提交于 2019-11-27 12:25:51
I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know how to finish it in an elegant way: addNoAnswer = function(df) { factorOrNot = sapply(df, is.factor) levelsList = lapply(df[, factorOrNot], levels) levelsList = lapply(levelsList, function(x) c(x, "No Answer")) ... Is there a way to directly apply new levels to factor columns, for example, something like this: df[, factorOrNot] = lapply(df[,

Make Frequency Histogram for Factor Variables

自古美人都是妖i 提交于 2019-11-27 11:47:13
I am very new to R, so I apologize for such a basic question. I spent an hour googling this issue, but couldn't find a solution. Say I have some categorical data in my data set about common pet types. I input it as a character vector in R that contains the names of different types of animals. I created it like this: animals <- c("cat", "dog", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "bird") I turn it into a factor for use with other vectors in my data frame: animalFactor <- as.factor(animals) I now want to create a histogram that shows the frequency of each variable on the y