categorical-data | 易学教程

Scikit-learn's LabelBinarizer vs. OneHotEncoder

阅读更多关于 Scikit-learn's LabelBinarizer vs. OneHotEncoder

问题 What is the difference between the two? It seems that both create new columns, which their number is equal to the number of unique categories in the feature. Then they assign 0 and 1 to data points depending on what category they are in. 回答1: A simple example which encodes an array using LabelEncoder, OneHotEncoder, LabelBinarizer is shown below. I see that OneHotEncoder needs data in integer encoded form first to convert into its respective encoding whihc is not required incase of

Is it possible to read categorical columns with pandas' read_csv?

阅读更多关于 Is it possible to read categorical columns with pandas' read_csv?

I have tried passing the dtype parameter with read_csv as dtype={n: pandas.Categorical} but this does not work properly (the result is an Object). The manual is unclear . In version 0.19.0 you can use parameter dtype='category' in read_csv : data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' df = pd.read_csv(pd.compat.StringIO(data), dtype='category') print (df) col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 print (df.dtypes) col1 category col2 category col3 category dtype: object If want specify column for category use dtype with dictionary: df = pd.read_csv(pd.compat.StringIO(data), dtype={'col1':

Mosaic plot with labels in each box showing a name and percentage of all observations

阅读更多关于 Mosaic plot with labels in each box showing a name and percentage of all observations

问题 I would like to create a mosaic plot (R package vcd, see e.g. http://cran.r-project.org/web/packages/vcd/vignettes/residual-shadings.pdf ) with labels inside the plot. The labels should show either a combination of the various factors or some custom label and the percentage of total observations in this combination of categories (see e.g. http://i.usatoday.net/communitymanager/_photos/technology-live/2011/07/28/nielsen0728x-large.jpg , despite this not quite being a mosaic plot). I suspect

Pandas DataFrame sort by categorical column but by specific class ordering

阅读更多关于 Pandas DataFrame sort by categorical column but by specific class ordering

问题 I would like to select the top entries in a Pandas dataframe base on the entries of a specific column by using df_selected = df_targets.head(N) . Each entry has a target value (by order of importance): Likely Supporter, GOTV, Persuasion, Persuasion+GOTV Unfortunately if I do df_targets = df_targets.sort("target") the ordering will be alphabetical ( GOTV , Likely Supporter , ...). I was hoping for a keyword like list_ordering as in: my_list = ["Likely Supporter", "GOTV", "Persuasion",

How to sort pandas dataframe by custom order on string index

阅读更多关于 How to sort pandas dataframe by custom order on string index

问题 I have the following data frame: import pandas as pd # Create DataFrame df = pd.DataFrame( {'id':[2967, 5335, 13950, 6141, 6169],\ 'Player': ['Cedric Hunter', 'Maurice Baker' ,\ 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\ 'Year': [1991 ,2004 ,2001 ,2009 ,1997],\ 'Age': [27 ,25 ,22 ,34 ,31],\ 'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],\ 'G':[6 ,7 ,60 ,52 ,81]}) df.set_index('Player', inplace=True) It shows: Out[128]: Age G Tm Year id Player Cedric Hunter 27 6 CHH 1991 2967 Maurice Baker 25

Can sklearn DecisionTreeClassifier truly work with categorical data?

阅读更多关于 Can sklearn DecisionTreeClassifier truly work with categorical data?

问题 While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data. All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5: From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn. Anyone knows a way that I am missing to

How to generate pandas DataFrame column of Categorical from string column?

阅读更多关于 How to generate pandas DataFrame column of Categorical from string column?

问题 I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str: train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized']) >>> type(pd.Categorical.from_array(train['LocationNormalized'])) <class 'pandas.core.categorical.Categorical'> # however it got converted back to... >>> type(train['LocationNFactor'][2]) <type 'str'> >>> train['LocationNFactor'][2] 'Hampshire' Guessing

Scikit-learn's LabelBinarizer vs. OneHotEncoder

阅读更多关于 Scikit-learn's LabelBinarizer vs. OneHotEncoder

What is the difference between the two? It seems that both create new columns, which their number is equal to the number of unique categories in the feature. Then they assign 0 and 1 to data points depending on what category they are in. A simple example which encodes an array using LabelEncoder, OneHotEncoder, LabelBinarizer is shown below. I see that OneHotEncoder needs data in integer encoded form first to convert into its respective encoding whihc is not required incase of LabelBinarizer. from numpy import array from numpy import argmax from sklearn.preprocessing import LabelEncoder from

How to manually set colours to a categorical variables using ggplot()? [duplicate]

阅读更多关于 How to manually set colours to a categorical variables using ggplot()? [duplicate]

This question already has an answer here: Manually setting group colors for ggplot2 1 answer This is my sample data table1 xaxis yaxis ae work 1 5 35736 Attending_Education Working 2 6 72286 Attending_Education Working 3 7 133316 Attending_Education Working 4 8 252520 Attending_Education Working 5 9 228964 Attending_Education Working 6 10 504676 Attending_Education Working This is the code i had used. p<-ggplot(table1,aes(x=table1$xaxis,y=table1$yaxis)) Economic_Activity<-factor(table1$work) Education_Status<-factor(table1$ae) p<-p+geom_point(aes(colour=Education_Status,shape=Economic_Activity

Rename the less frequent categories by “OTHER” python

阅读更多关于 Rename the less frequent categories by “OTHER” python

In my dataframe I have some categorical columns with over 100 different categories. I want to rank the categories by the most frequent. I keep the first 9 most frequent categories and the less frequent categories rename them automatically by: OTHER Example: Here my df : print(df) Employee_number Jobrol 0 1 Sales Executive 1 2 Research Scientist 2 3 Laboratory Technician 3 4 Sales Executive 4 5 Research Scientist 5 6 Laboratory Technician 6 7 Sales Executive 7 8 Research Scientist 8 9 Laboratory Technician 9 10 Sales Executive 10 11 Research Scientist 11 12 Laboratory Technician 12 13 Sales