Counting words after grouping records

问题

Note: Although the provided answer is working, it can get rather slow on larger data sets. Take a look at this for a faster solution.

I am having a data frame which consists of labelled document such as this one:

df_ = spark.createDataFrame([
    ('1', 'hello how are are you today'),
    ('1', 'hello how are you'),
    ('2', 'hello are you here'),
    ('2', 'how is it'),
    ('3', 'hello how are you'),
    ('3', 'hello how are you'),
    ('4', 'hello how is it you today')
], schema=['label', 'text'])

What I want is to group the data frame by label and make a simple word count for each group. My problem is I'm not sure how I can do this in PySpark. In a first step I would split the text and get the document as a list of tokens:

def get_token_counts(text):
    if text is None:
        return list()    
    counter = Counter(text.lower().split())
    return list(counter.items())

udf_get_token_counts = F.udf(get_token_counts)

df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
    .show()

Gives

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    1|[hello, how, are,...|
|    1|[hello, how, are,...|
|    2|[hello, are, you,...|
|    2|[hello, how, is, it]|
|    3|[hello, how, are,...|
|    3|[hello, how, are,...|
|    4|[hello, how, is, ...|
+-----+--------------------+

I know how I can make a word count over the entire data frame but I don't know how I have to proceed with groupby() or reducyByKey().

I was thinking about partially counting the words in the data frame:

df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
    .rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
    .toDF(schema=['label', 'text'])\
    .show()

which gives:

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    1|[[are,2], [hello,...|
|    1|[[are,1], [hello,...|
|    2|[[are,1], [hello,...|
|    2|[[how,1], [it,1],...|
|    3|[[are,1], [hello,...|
|    3|[[are,1], [hello,...|
|    4|[[you,1], [today,...|
+-----+--------------------+

but how can I aggregate this?

回答1:

You should use pyspark.ml.feature.Tokenizer to split the text instead of using udf. (Also, depending on what you are doing, you may find StopWordsRemover to be useful.)

For example:

from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
tokens = tokenizer.transform(df_)
tokens.show(truncate=False)
+-----+---------------------------+----------------------------------+
|label|text                       |tokens                            |
+-----+---------------------------+----------------------------------+
|1    |hello how are are you today|[hello, how, are, are, you, today]|
|1    |hello how are you          |[hello, how, are, you]            |
|2    |hello are you here         |[hello, are, you, here]           |
|2    |how is it                  |[how, is, it]                     |
|3    |hello how are you          |[hello, how, are, you]            |
|3    |hello how are you          |[hello, how, are, you]            |
|4    |hello how is it you today  |[hello, how, is, it, you, today]  |
+-----+---------------------------+----------------------------------+

Then you can explode() the tokens, and do a groupBy() to get the count for each word:

import pyspark.sql.functions as f
token_counts = tokens.select("label", f.explode("tokens").alias("token"))\
    .groupBy("label", "token").count()\
    .orderBy("label", "token")
token_counts.show(truncate=False, n=10)
+-----+-----+-----+
|label|token|count|
+-----+-----+-----+
|1    |are  |3    |
|1    |hello|2    |
|1    |how  |2    |
|1    |today|1    |
|1    |you  |2    |
|2    |are  |1    |
|2    |hello|1    |
|2    |here |1    |
|2    |how  |1    |
|2    |is   |1    |
+-----+-----+-----+
only showing top 10 rows

If you want all of the tokens and counts on one row per label, just do another groupBy() with pyspark.sql.functions.collect_list() and concatenate the token and count columns using pyspark.sql.functions.struct():

tokens.select("label", f.explode("tokens").alias("token"))\
    .groupBy("label", "token")\
    .count()\
    .groupBy("label")\
    .agg(f.collect_list(f.struct(f.col("token"), f.col("count"))).alias("text"))\
    .orderBy("label")\
    .show(truncate=False)
+-----+----------------------------------------------------------------+
|label|text                                                            |
+-----+----------------------------------------------------------------+
|1    |[[hello,2], [how,2], [are,3], [today,1], [you,2]]               |
|2    |[[you,1], [hello,1], [here,1], [are,1], [it,1], [how,1], [is,1]]|
|3    |[[are,2], [you,2], [how,2], [hello,2]]                          |
|4    |[[today,1], [hello,1], [it,1], [you,1], [how,1], [is,1]]        |
+-----+----------------------------------------------------------------+

来源：https://stackoverflow.com/questions/49923359/counting-words-after-grouping-records

标签

pyspark