问题
Given the following rdd
training_rdd = rdd.select(
# Categorical features
col('device_os'), # 'ios', 'android'
# Numeric features
col('30day_click_count'),
col('30day_impression_count'),
np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),
# label
col('did_click').alias('label')
)
I am confused about the syntax to train a gradient boosting classifer.
I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier
However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.
回答1:
You can use VectorAssembler
to generate the feature vector. Please note that you will have to convert your rdd
to a DataFrame
first.
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(["device_os",
"30day_click_count",
"30day_impression_count",
"30day_click_through_rate"])
vectorizer.setOutputCol("features")
And consequently, you will need to put vectorizer
as the first stage into the Pipeline
:
pipeline = Pipeline([vectorizer, ...])
来源:https://stackoverflow.com/questions/39504718/how-to-train-sparkml-gradient-boosting-classifer-given-a-rdd