categorical-data

How do I run the Spark decision tree with a categorical feature set using Scala?

喜夏-厌秋 提交于 2019-12-04 02:57:19
I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles. val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail))) // Run training algorithm to build the model val maxDepth: Int = 3 val isMulticlassWithCategoricalFeatures: Boolean = true val numClassesForClassification: Int = countPossibilities(labelCol)

how to check for correlation among continuous and categorical variables in python?

感情迁移 提交于 2019-12-03 14:47:20
I have a dataset including categorical variables(binary) and continuous variables. I'm trying to apply a linear regression model for predicting a continuous variable. Can someone please let me know how to check for correlation among the categorical variables and the continuous target variable. Current Code: import pandas as pd df_hosp = pd.read_csv('C:\Users\LAPPY-2\Desktop\LengthOfStay.csv') data = df_hosp[['lengthofstay', 'male', 'female', 'dialysisrenalendstage', 'asthma', \ 'irondef', 'pneum', 'substancedependence', \ 'psychologicaldisordermajor', 'depress', 'psychother', \

Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn

时光怂恿深爱的人放手 提交于 2019-12-03 12:32:23
问题 The main goals are as follows: 1) Apply StandardScaler to continuous variables 2) Apply LabelEncoder and OnehotEncoder to categorical variables The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects. On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not we what. Since continuous variables and categorical ones are

R coxph() warning: Loglik converged before variable

ぐ巨炮叔叔 提交于 2019-12-03 12:20:08
I'm having some trouble using coxph(). I've two categorical variables: Sex and Probable Cause, that I want to use as predictor variables. Sex is just the typical male/female but Probable Cause has 5 options. I don't know what is the problem with the warning message. Why does the cofidence intervals are from 0 to Inf and the p-values so high? Here's the code and the output: > my_coxph <- coxph(Surv(tempo,status) ~ factor(Sexo)+ factor(Causa.provavel) , data=ceabn) Warning message: In fitter(X, Y, strats, offset, init, control, weights = weights, : Loglik converged before variable 2,3,5,6 ; beta

How to correlate an Ordinal Categorical column in pandas?

时间秒杀一切 提交于 2019-12-03 11:31:18
问题 I have a DataFrame df with a non-numerical column CatColumn . A B CatColumn 0 381.1396 7.343921 Medium 1 481.3268 6.786945 Medium 2 263.3766 7.628746 High 3 177.2400 5.225647 Medium-High I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis. 回答1: I am going to strongly disagree with the other comments. They miss the main point of correlation: How much

pandas dataframe convert column type to string or categorical

*爱你&永不变心* 提交于 2019-12-03 06:38:19
问题 How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regression, zipcode is treated as categorical and not numeric. Thanks! df = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms':

Any way to get mappings of a label encoder in Python pandas?

痞子三分冷 提交于 2019-12-03 04:50:43
I am converting strings to categorical values in my dataset using the following piece of code. data['weekday'] = pd.Categorical.from_array(data.weekday).labels For eg, index weekday 0 Sunday 1 Sunday 2 Wednesday 3 Monday 4 Monday 5 Thursday 6 Tuesday After encoding the weekday, my dataset appears like this: index weekday 0 3 1 3 2 6 3 1 4 1 5 4 6 5 Is there any way I can know that Sunday has been mapped to 3, Wednesday to 6 and so on? The best way of doing this can be to use label encoder of sklearn library. Something like this: from sklearn import preprocessing le = preprocessing.LabelEncoder

Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn

纵饮孤独 提交于 2019-12-03 03:01:39
The main goals are as follows: 1) Apply StandardScaler to continuous variables 2) Apply LabelEncoder and OnehotEncoder to categorical variables The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects. On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not we what. Since continuous variables and categorical ones are mixed in a single Pandas DataFrame, what's the recommended workflow to approach this kind of problem?

How to correlate an Ordinal Categorical column in pandas?

独自空忆成欢 提交于 2019-12-03 02:07:33
I have a DataFrame df with a non-numerical column CatColumn . A B CatColumn 0 381.1396 7.343921 Medium 1 481.3268 6.786945 Medium 2 263.3766 7.628746 High 3 177.2400 5.225647 Medium-High I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis. I am going to strongly disagree with the other comments. They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order

R - convert from categorical to numeric for KNN

孤人 提交于 2019-12-02 21:58:00
问题 I'm trying to use the Caret package of R to use the KNN applied to the "abalone" database from UCI Machine Learning (link to the data). But it doesn't allow to use KNN when there's categorical values. How do I convert the categorical values (in this database: "M","F","I" ) to numeric values, such as 1,2,3 , respectively? 回答1: When data are read in via read.table , the data in the first column are factors. Then data$iGender = as.integer(data$Gender) would work. If they are character, a detour