问题
I don't know why I receive the message
WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
When I try to use Spark KMeans
df_Part = assembler.transform(df_Part)
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1
It says that my input (Dataframe) is not cached !!
I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?
回答1:
This message is generated by the o.a.s.mllib.clustering.KMeans
and there is nothing you can really about it without patching Spark code.
Internally o.a.s.ml.clustering.KMeans
:
- Converts
DataFrame
toRDD[o.a.s.mllib.linalg.Vector]
. - Executes
o.a.s.mllib.clustering.KMeans
.
While you cache DataFrame
, RDD
which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.
回答2:
This was fixed in Spark 2.2.0. Here is the Spark-18356.
The discussion there also suggests this is not a big deal, but the fix may reduce runtime slightly, as well as avoiding warnings.
来源:https://stackoverflow.com/questions/40406166/pyspark-2-kmeans-the-input-data-is-not-directly-cached