问题
How do I remove the empty tweets using filter()
in pyspark? I have done the following
tweets = sc.textFile(.....)
tweets.count()
the result gives me 13995. However when I imported the data from mongodb, it showed 11186
I can't seem to apply the filter()
command for removing the empty tweets. Help please.
回答1:
If your data like this
tweets = sc.parallelize(["title1", "", "title2", "title3", ""])
you can use len(x)
as the filter condition:
tweets.filter(lambda x: len(x) > 0).count()
来源:https://stackoverflow.com/questions/40504810/how-do-i-remove-the-empty-tweets-using-filter-in-pyspark