问题

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.

This is my code:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)

x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)

回答1:

There are a couple ways to deal with this, and it kind of depends what you're looking for:

You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,

dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.

In lots of machine learning applications, factors are better to deal with as dummy codes.

Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).

Encoding Categories to numeric:

Method 1: pd.factorize

pd.factorize is a simple, fast way of encoding to numeric:

For example, if your column gender looks like this:

>>> df
   gender
0  Female
1    Male
2    Male
3    Male
4  Female
5  Female
6    Male
7  Female
8  Female
9  Female

df['gender_factor'] = pd.factorize(df.gender)[0]

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0

Method 2: categorical dtype

Another way would be to use category dtype:

df['gender_factor'] = df['gender'].astype('category').cat.codes

This would result in the same output

Method 3 sklearn.preprocessing.LabelEncoder()

This method comes with some bonuses, such as easy back transforming:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0

# Easy to back transform:

df['gender_factor'] = le.inverse_transform(df.gender_factor)

>>> df
   gender gender_factor
0  Female        Female
1    Male          Male
2    Male          Male
3    Male          Male
4  Female        Female
5  Female        Female
6    Male          Male
7  Female        Female
8  Female        Female
9  Female        Female

Dummy Coding:

Method 1: pd.get_dummies

df.join(pd.get_dummies(df.gender))

   gender  Female  Male
0  Female       1     0
1    Male       0     1
2    Male       0     1
3    Male       0     1
4  Female       1     0
5  Female       1     0
6    Male       0     1
7  Female       1     0
8  Female       1     0
9  Female       1     0

Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:

df.join(pd.get_dummies(df.gender, drop_first=True))

   gender  Male
0  Female     0
1    Male     1
2    Male     1
3    Male     1
4  Female     0
5  Female     0
6    Male     1
7  Female     0
8  Female     0
9  Female     0

来源：https://stackoverflow.com/questions/50996623/could-not-convert-string-to-float-error-from-the-titanic-competition

标签

python