Issue with Spark MLLib that causes probability and prediction to be the same for everything

天大地大妈咪最大 提交于 2019-12-11 02:29:47

问题


I'm learning how to use Machine Learning with Spark MLLib with the purpose of doing Sentiment Analysis of Tweets. I got a Sentiment Analysis dataset from here: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

That dataset contains 1 million of tweets classified as Positive or Negative. The second column of this dataset contains the sentiment and the fourth column contains the tweet.

This is my current PySpark code:

import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression

data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)

r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))

parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)

remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)

train_data_raw = base_words.select("base_words", "label")

word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")

model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)

lrModel.transform(final_train_data).show()

I'm executing this on PySpark interactive shell using this command:

pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'

(FYI: I have a HDFS cluster with 10 VMs with YARN, Spark, etc)

As a result of the last line of code, this is what happens:

>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

If I do the same with a smaller dataset that I have created manually it works. I don't know what is happening, have been working with this thru the day.

Any suggestions?

Thanks for your time!


回答1:


TL;DR Ten iterations is way to low for any real life applications. On large and non-trivial datasets it can take thousand or more iterations (as well as tuning remaining parameters) to converge.

Binomial LogisticRegressionModel has summary attribute, which can give you an access to a LogisticRegressionSummary object. Among other useful metrics it contains objectiveHistory which can be used to debug training process:

import matplotlib.pyplot as plt

lrm = LogisticRegression(..., family="binomial").fit(df)
plt.plot(lrm.summary.objectiveHistory)

plt.show()


来源:https://stackoverflow.com/questions/44480077/issue-with-spark-mllib-that-causes-probability-and-prediction-to-be-the-same-for

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!