categorical-data

Is there an advantage to ordering a categorical variable?

僤鯓⒐⒋嵵緔 提交于 2019-12-02 01:20:29
I have been advised that it is best to order categorical variables where appropriate (e.g. short less than medium less than long). I am wondering, what is the specific advantage of treating a categorical variable as ordered as opposed to just simple categorical, in the context of modelling it as an explanatory variable? What does it mean mathematically (in lay terms preferably!)? Many thanks! Among other things, it allows you to compare values from those factors: > ord.fac <- ordered(c("small", "medium", "large"), levels=c("small", "medium", "large")) > fac <- factor(c("small", "medium",

A Faster Way of Removing Unused Categories in Pandas?

会有一股神秘感。 提交于 2019-12-02 00:04:01
问题 I'm running some models in Python, with data subset on categories. For memory usage, and preprocessing, all the categorical variables are stored as category data type. For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset. I am currently doing this using .cat.remove_unused_categories() , which is taking nearly 50% of my total runtime. At the moment, the worst

A Faster Way of Removing Unused Categories in Pandas?

自作多情 提交于 2019-12-01 22:41:20
I'm running some models in Python, with data subset on categories. For memory usage, and preprocessing, all the categorical variables are stored as category data type. For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset. I am currently doing this using .cat.remove_unused_categories() , which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels

Can sklearn DecisionTreeClassifier truly work with categorical data?

笑着哭i 提交于 2019-12-01 17:41:47
While working with the DecisionTreeClassifier I visualized it using graphviz , and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data. All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5: From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn. Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I

Matplotlib: how to plot categorical data on the y-axis?

我只是一个虾纸丫 提交于 2019-12-01 14:38:15
Let's say that I have the following code, which comes from here : gender = ['male','male','female','male','female'] import matplotlib.pyplot as plt from collections import Counter c = Counter(gender) men = c['male'] women = c['female'] bar_heights = (men, women) x = (1, 2) fig, ax = plt.subplots() width = 0.4 ax.bar(x, bar_heights, width) ax.set_xlim((0, 3)) ax.set_ylim((0, max(men, women)*1.1)) ax.set_xticks([i+width/2 for i in x]) ax.set_xticklabels(['male', 'female']) plt.show() How could the categories male and female be plotted on the y-axis, as opposed to the x-axis? Perhaps you're

predict.glm() with three new categories in the test data (r)(error)

倾然丶 夕夏残阳落幕 提交于 2019-12-01 14:13:34
I have a data set called data which has 481 092 rows. I split data into two equal halves: The first halve (row 1: 240 546) is called train and was used for the glm() ; the second halve (row 240 547 : 481 092) is called test and should be used to validate the model; Then I started the regression: testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"), data=train) Now the prediction: prediction <- predict.glm(testreg, newdata=test, type="response") gives

predict.glm() with three new categories in the test data (r)(error)

一曲冷凌霜 提交于 2019-12-01 13:21:36
问题 I have a data set called data which has 481 092 rows. I split data into two equal halves: The first halve (row 1: 240 546) is called train and was used for the glm() ; the second halve (row 240 547 : 481 092) is called test and should be used to validate the model; Then I started the regression: testreg <- glm(train$returnShipment ~ train$size + train$color + train$price + train$manufacturerID + train$salutation + train$state + train$age + train$deliverytime, family=binomial(link="logit"),

Matplotlib: how to plot categorical data on the y-axis?

本秂侑毒 提交于 2019-12-01 13:21:22
问题 Let's say that I have the following code, which comes from here: gender = ['male','male','female','male','female'] import matplotlib.pyplot as plt from collections import Counter c = Counter(gender) men = c['male'] women = c['female'] bar_heights = (men, women) x = (1, 2) fig, ax = plt.subplots() width = 0.4 ax.bar(x, bar_heights, width) ax.set_xlim((0, 3)) ax.set_ylim((0, max(men, women)*1.1)) ax.set_xticks([i+width/2 for i in x]) ax.set_xticklabels(['male', 'female']) plt.show() How could

How can I one hot encode multiple variables with big data in R?

*爱你&永不变心* 提交于 2019-12-01 06:09:32
问题 I currently have a dataframe with 260,000 rows and 50 columns where 3 columns are numeric and the rest are categorical. I wanted to one hot encode the categorical columns in order to perform PCA and use regression to predict the class. How can I go about accomplishing the below example in R? Example: V1 V2 V3 V4 V5 .... VN-1 VN to V1_a V1_b V2_a V2_b V2_c V3_a V3_b and so on 回答1: You can use model.matrix or sparse.model.matrix . Something like this: sparse.model.matrix(~. -1, data = your_data

pd.get_dummies() slow on large levels

浪子不回头ぞ 提交于 2019-12-01 05:05:26
问题 I'm unsure if this is already the fastest possible method, or if I'm doing this inefficiently. I want to hot encode a particular categorical column which has 27k+ possible levels. The column has different values in 2 different datasets, so I combined the levels first before using get_dummies() def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True): col1b = set(df2[column_name].unique()) col1a = set(df[column_name].unique()) combined_cats = list(col1a.union(col1b)) df[column