PySpark 2: KMeans The input data is not directly cached

那年仲夏 提交于 2019-12-10 17:56:05

问题


I don't know why I receive the message

WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.

When I try to use Spark KMeans

df_Part = assembler.transform(df_Part)    
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
                    kmeans = KMeans().setK(k)
                    model = kmeans.fit(df_Part)
                    wssse = model.computeCost(df_Part)
                    k=k+1

It says that my input (Dataframe) is not cached !!

I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?


回答1:


This message is generated by the o.a.s.mllib.clustering.KMeans and there is nothing you can really about it without patching Spark code.

Internally o.a.s.ml.clustering.KMeans:

  • Converts DataFrame to RDD[o.a.s.mllib.linalg.Vector].
  • Executes o.a.s.mllib.clustering.KMeans.

While you cache DataFrame, RDD which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.




回答2:


This was fixed in Spark 2.2.0. Here is the Spark-18356.

The discussion there also suggests this is not a big deal, but the fix may reduce runtime slightly, as well as avoiding warnings.



来源:https://stackoverflow.com/questions/40406166/pyspark-2-kmeans-the-input-data-is-not-directly-cached

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!