How to train SparkML gradient boosting classifer given a RDD

问题

Given the following rdd

training_rdd = rdd.select(
    # Categorical features
    col('device_os'), # 'ios', 'android'

    # Numeric features
    col('30day_click_count'), 
    col('30day_impression_count'),
    np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),

    # label
    col('did_click').alias('label')
)

I am confused about the syntax to train a gradient boosting classifer.

I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier

However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.

回答1:

You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.

from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()

vectorizer.setInputCols(["device_os",
                         "30day_click_count",
                         "30day_impression_count",
                         "30day_click_through_rate"])

vectorizer.setOutputCol("features")

And consequently, you will need to put vectorizer as the first stage into the Pipeline:

pipeline = Pipeline([vectorizer, ...])

来源：https://stackoverflow.com/questions/39504718/how-to-train-sparkml-gradient-boosting-classifer-given-a-rdd

标签

apache-spark

pyspark

apache-spark-ml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!