issue in encoding non-numeric feature to numeric in Spark and Ipython

问题

I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim. I have my features data in a dataframe which looks like this:

     _1      _2     _3              _4  
0  Level1    Male  New York         New York   
1  Level1    Male  San Fransisco    California   
2  Level2    Male  New York         New York   
3  Level1    Male  Columbus         Ohio   
4  Level3    Male  New York         New York   
5  Level4    Male  Columbus         Ohio   
6  Level5    Female  Stamford       Connecticut   
7  Level1    Female  San Fransisco  California   
8  Level3    Male  Stamford         Connecticut   
9  Level6    Female  Columbus       Ohio

Here columns are - employee level,gender,city,state and these are my features using which I want to make predictions of employee monthly spending(the label,in $).

The training label set looks like this:

Since the features are in non-numeric form so I need to encode them to numeric. So I am following this link to encode categorical data into numbers. I wrote this code for this (following the process mentioned in linked article):

import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
import pandas as pd
def extract(line):
    return (line[1],line[2],line[3],line[7],line[9],line[10],line[22])

inputfile = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)


input_data = (inputfile
    .map(lambda line: line.split(","))
    .filter(lambda line: len(line) >1 )
    .map(extract)) # Map to tuples

(train_data, test_data) = input_data.randomSplit([0.8, 0.2])

# converting RDD to dataframe
train_dataframe = train_data.toDF()
# converting to pandas dataframe
train_pandas = train_dataframe.toPandas()
# filtering features
train_pandas_features = train_pandas.iloc[:,:6]
# filtering label
train_pandas_label = train_pandas.iloc[:,6]

train_pandas_features_dict = train_pandas_features.T.to_dict().values()

# encoding features to numeric
vectorizer = DV( sparse = False )
vec_train = vectorizer.fit_transform( train_pandas_features_dict )

When I do print vec_train all I see is 0. in all features columns. Something like this:

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

I think somewhere I am making some mistake because of which this encoding is not producing correct result. What mistake am I doing? And is there some other better way to encode non-numeric features to numeric for the case I described at the top(predicting numeric monthly expenditure based on non-numeric employee data)?

回答1:

Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:

from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler

label_col = "x3"  # For example

# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
    .toDF(("x0", "x1", "x2", "x3")))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))

   # For classifications problems
   #   - if you want to use ML you should index label as well
   #   - if you want to use MLlib it is not necessary
   # For regression problems you should omit label in the indexing
   # as shown below
   for x in df.columns if x not in {label_col} # Exclude other columns if needed
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)

Pipeline defined above will create following data frame:

indexed.printSchema()
## root
##  |-- x0: string (nullable = true)
##  |-- x1: string (nullable = true)
##  |-- x2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- idx_x0: double (nullable = true)
##  |-- idx_x1: double (nullable = true)
##  |-- idx_x2: double (nullable = true)
##  |-- features: vector (nullable = true)

where features should be a valid input for mllib.tree.DecisionTree (see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).

You can create label points out of it as follows:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

label_points = (indexed
    .select(col(label_col).alias("label"), col("features"))
    .map(lambda row: LabeledPoint(row.label, row.features)))

来源：https://stackoverflow.com/questions/33981740/issue-in-encoding-non-numeric-feature-to-numeric-in-spark-and-ipython

标签

python

apache-spark

machine-learning

dataframe

pyspark