pyspark

Dividing dataframes in pyspark

不想你离开。 提交于 2021-01-29 05:33:59
问题 Following up this question and dataframes, I am trying to convert this Into this (I know it looks the same, but refer to the next code line to see the difference): In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was: df_2 = (df/df.groupby(["age"]).sum()) However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame' The second one was: df_2 = (df.filter

how to read hdfs file with wildcard character used by pyspark

这一生的挚爱 提交于 2021-01-29 05:18:32
问题 there are some parquet file paths are: /a/b/c='str1'/d='str' /a/b/c='str2'/d='str' /a/b/c='str3'/d='str' I want to read the parquet files like this: df = spark.read.parquet('/a/b/c='*'/d='str') but it doesn't work by using "*" wildcard character.How can I do that? thank you for helping 回答1: You need to escape single quotes: df = spark.read.parquet('/a/b/c=\'*\'/d=\'str\'') ... or just use double quotes: df = spark.read.parquet("/a/b/c='*'/d='str'") 来源: https://stackoverflow.com/questions

categorise text in column using keywords

牧云@^-^@ 提交于 2021-01-29 04:09:47
问题 I have a table column, that contain description of the treatment done to resolve an issue, this text contian keywords. In other list, I have the list of categories, with the different keywords that helps to identify it. For example: Category | keywords AAAA | keyword1 AAAA | keyword2 and keyword3 AAAA | keyword3 and not keyword4 BBBB | keyword4 BBBB | keyword5 and keyword6 BBBB | keyword7 how can fill the category column in my previous table (that contain the description), using the keywords

categorise text in column using keywords

血红的双手。 提交于 2021-01-29 04:07:04
问题 I have a table column, that contain description of the treatment done to resolve an issue, this text contian keywords. In other list, I have the list of categories, with the different keywords that helps to identify it. For example: Category | keywords AAAA | keyword1 AAAA | keyword2 and keyword3 AAAA | keyword3 and not keyword4 BBBB | keyword4 BBBB | keyword5 and keyword6 BBBB | keyword7 how can fill the category column in my previous table (that contain the description), using the keywords

pyspark-strange behavior of count function inside agg

ぃ、小莉子 提交于 2021-01-29 02:52:43
问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

pyspark-strange behavior of count function inside agg

别说谁变了你拦得住时间么 提交于 2021-01-29 02:45:43
问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

Count of all element less than the value in a row

余生颓废 提交于 2021-01-28 21:14:33
问题 Given a dataframe value ----- 0.3 0.2 0.7 0.5 is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically, value count_less_equal ------------------------- 0.3 2 0.2 1 0.7 4 0.5 3 I could groupBy the value column but I don't know how to filter all values in the row that are less that that value. I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each

Airflow ModuleNotFoundError: No module named 'pyspark'

╄→гoц情女王★ 提交于 2021-01-28 21:12:13
问题 I installed Airflow on my machine which works well and I have a local spark also (which is operational too). I want to use airflow to orchestrate two sparks tasks: task_spark_datatransform >> task_spark_model_reco . The two pyspark modules associated to these two tasks are tested and work well under spark. I also create a very simple Airflow Dag using bashOperator * to run each spark task. For example, for the task task_spark_datatransform I have: task_spark_datatransform = BashOperator (task

BigQuery connector ClassNotFoundException in PySpark on Dataproc

微笑、不失礼 提交于 2021-01-28 20:07:28
问题 I'm trying to run a script in PySpark, using Dataproc. The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't. The error I get is: File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.ClassNotFoundException: com.google.cloud.hadoop.io

Appending column name to column value using Spark

柔情痞子 提交于 2021-01-28 20:05:44
问题 I have data in comma separated file, I have loaded it in the spark data frame: The data looks like: A B C 1 2 3 4 5 6 7 8 9 I want to transform the above data frame in spark using pyspark as: A B C A_1 B_2 C_3 A_4 B_5 C_6 -------------- Then convert it to list of list using pyspark as: [[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]] And then run FP Growth algorithm using pyspark on the above data set. The code that I have tried is below: from pyspark.sql.functions import col, size from pyspark.sql