How do I remove the empty tweets using filter() in pyspark?

人走茶凉 提交于 2019-12-11 09:15:47

问题


How do I remove the empty tweets using filter() in pyspark? I have done the following

tweets = sc.textFile(.....)
tweets.count()

the result gives me 13995. However when I imported the data from mongodb, it showed 11186

I can't seem to apply the filter() command for removing the empty tweets. Help please.


回答1:


If your data like this

tweets = sc.parallelize(["title1", "", "title2", "title3", ""])

you can use len(x) as the filter condition:

tweets.filter(lambda x: len(x) > 0).count()


来源:https://stackoverflow.com/questions/40504810/how-do-i-remove-the-empty-tweets-using-filter-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!