问题
I am having a problem with this NN regression model in keras. I am working on a cars dataset to predict the price based on 13 dimensions. In short, I have read it as pandas dataframe, converted numeric values to float, scaled the values, and then used one-hot encoding for categorical values, which has created a lot of new columns, but that does not concern me much at this point. What concerns me is that the accuracy is practically 0%, and I cannot figure out why. Dataset can be found here: https://www.kaggle.com/CooperUnion/cardataset/data. Below is the code:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import to_categorical
# load dataset
# Columns : Make, Model, Year, Engine Fuel Type, Engine HP, Engine Cylinders, Transmission Type, Driven_Wheels, Number of Doors, Vehicle Size, Vehicle Style, highway MPG, city mpg, Popularity, MSRP
import pandas as pd
dataframe = pd.read_csv("cars.csv", header = 'infer', names=['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP', 'Engine Cylinders', 'Transmission Type', 'Driven_Wheels', 'Number of Doors', 'Vehicle Size', 'Vehicle Style', 'highway MPG', 'city mpg', 'Popularity', 'MSRP'])
#convert data columns to float
dataframe[['Engine HP', 'highway MPG', 'city mpg', 'Popularity', 'MSRP']] = dataframe[['Engine HP', 'highway MPG', 'city mpg', 'Popularity', 'MSRP']].apply(pd.to_numeric)
#normalize the values - divide my their max value
dataframe["Engine HP"] = dataframe["Engine HP"] / dataframe["Engine HP"].max()
dataframe["highway MPG"] = dataframe["highway MPG"] / dataframe["highway MPG"].max()
dataframe["city mpg"] = dataframe["city mpg"] / dataframe["city mpg"].max()
dataframe["Popularity"] = dataframe["Popularity"] / dataframe["Popularity"].max()
dataframe["MSRP"] = dataframe["MSRP"] / dataframe["MSRP"].max()
#split input and label
x = dataframe.iloc[:,0:14]
y = dataframe.iloc[:,14]
#one-hot encoding for categorical values
def one_hot(df, cols):
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
#columns to transform
cols_to_tran = ['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine Cylinders', 'Transmission Type', 'Driven_Wheels', 'Number of Doors', 'Vehicle Size', 'Vehicle Style']
d = one_hot(x, cols_to_tran)
list(d.columns.values)
#drop first original 11 columns
e = d.drop(d.columns[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]], axis=1)
list(e.columns.values)
#create train and test datasets - 80% for train and 20% for validation
t = len(e)*0.8
t = int(t)
train_data = e[0:t]
train_targets = y[0:t]
test_data = e[t:]
test_targets = y[t:]
#convert to numpy array
train_data = train_data.values
train_targets = train_targets.values
test_data = test_data.values
test_targets = test_targets.values
# Sample Multilayer Perceptron Neural Network in Keras
from keras.models import Sequential
from keras.layers import Dense
import numpy
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(Dense(32, activation='relu'))
#model.add(Dense(1, activation='sigmoid'))
model.add(Dense(1))
# 2. compile the network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# 3. fit the network
history = model.fit(train_data, train_targets, epochs=100, batch_size=50)
# 4. evaluate the network
loss, accuracy = model.evaluate(test_data, test_targets)
print("\nLoss: %.2f, Accuracy: %.2f%%" % (loss, accuracy*100))
# 5. make predictions
probabilities = model.predict(test_data)
predictions = [float(x) for x in probabilities]
accuracy = numpy.mean(predictions == test_targets)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))
And the result is as per below:
Any help would be appreciated.
回答1:
Accuracy is a classification metric, it makes no sense to use it for regression. There is no actual problem.
回答2:
First of all, you should consider clean your code when posting questions in stackoverflow. I've tried to replicate your code and found some errors before getting your dataset clean at numpy arrays train_data
, train_targets
, test_data
and test_targets
.
Focusing on machine learning theory, if you don't shuffle your dataset, it is going to be really harder for you regression model to get trained. Try shuffling your dataset using random.shuffle()
before splitting train and test subsets.
As stated by Matias answer, if you are working on a regression problem (instead of a classification one) it makes no sense to use the accuracy metric.
Furthermore, binary crossentropy loss is only suitable for classification too, so it neither makes sense. Typical loss used for regression models is Mean Square Error. Consider changing you keras model compiling by:
model.compile(loss='mean_squared_error', optimizer='adam')
Hope this helps!
来源:https://stackoverflow.com/questions/50140500/keras-nn-regression-model-gives-low-loss-and-0-acuracy