pyspark | 易学教程

Dividing dataframes in pyspark

阅读更多关于 Dividing dataframes in pyspark

问题 Following up this question and dataframes, I am trying to convert this Into this (I know it looks the same, but refer to the next code line to see the difference): In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was: df_2 = (df/df.groupby(["age"]).sum()) However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame' The second one was: df_2 = (df.filter

how to read hdfs file with wildcard character used by pyspark

阅读更多关于 how to read hdfs file with wildcard character used by pyspark

问题 there are some parquet file paths are: /a/b/c='str1'/d='str' /a/b/c='str2'/d='str' /a/b/c='str3'/d='str' I want to read the parquet files like this: df = spark.read.parquet('/a/b/c='*'/d='str') but it doesn't work by using "*" wildcard character.How can I do that? thank you for helping 回答1: You need to escape single quotes: df = spark.read.parquet('/a/b/c=\'*\'/d=\'str\'') ... or just use double quotes: df = spark.read.parquet("/a/b/c='*'/d='str'") 来源： https://stackoverflow.com/questions

categorise text in column using keywords

阅读更多关于 categorise text in column using keywords

问题 I have a table column, that contain description of the treatment done to resolve an issue, this text contian keywords. In other list, I have the list of categories, with the different keywords that helps to identify it. For example: Category | keywords AAAA | keyword1 AAAA | keyword2 and keyword3 AAAA | keyword3 and not keyword4 BBBB | keyword4 BBBB | keyword5 and keyword6 BBBB | keyword7 how can fill the category column in my previous table (that contain the description), using the keywords

categorise text in column using keywords

阅读更多关于 categorise text in column using keywords

pyspark-strange behavior of count function inside agg

阅读更多关于 pyspark-strange behavior of count function inside agg

问题 I am using spark 2.4.0 I am observing a strange behavior while using count function to aggregate. from pyspark.sql import functions as F tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2']) tst.show() +----+----+ |col1|col2| +----+----+ | 1| 2| | 1| 5| | 2|null| | 2| 3| | 3|null| | 3|null| +----+----+ tst.groupby('col1').agg(F.count('col2')).show() +----+-----------+ |col1|count(col2)| +----+-----------+ | 1| 2| | 3| 0| | 2| 1| +----+----------

pyspark-strange behavior of count function inside agg

阅读更多关于 pyspark-strange behavior of count function inside agg

Count of all element less than the value in a row

阅读更多关于 Count of all element less than the value in a row

问题 Given a dataframe value ----- 0.3 0.2 0.7 0.5 is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically, value count_less_equal ------------------------- 0.3 2 0.2 1 0.7 4 0.5 3 I could groupBy the value column but I don't know how to filter all values in the row that are less that that value. I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each

Airflow ModuleNotFoundError: No module named 'pyspark'

阅读更多关于 Airflow ModuleNotFoundError: No module named 'pyspark'

问题 I installed Airflow on my machine which works well and I have a local spark also (which is operational too). I want to use airflow to orchestrate two sparks tasks: task_spark_datatransform >> task_spark_model_reco . The two pyspark modules associated to these two tasks are tested and work well under spark. I also create a very simple Airflow Dag using bashOperator * to run each spark task. For example, for the task task_spark_datatransform I have: task_spark_datatransform = BashOperator (task

BigQuery connector ClassNotFoundException in PySpark on Dataproc

阅读更多关于 BigQuery connector ClassNotFoundException in PySpark on Dataproc

问题 I'm trying to run a script in PySpark, using Dataproc. The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't. The error I get is: File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.ClassNotFoundException: com.google.cloud.hadoop.io

Appending column name to column value using Spark

阅读更多关于 Appending column name to column value using Spark

问题 I have data in comma separated file, I have loaded it in the spark data frame: The data looks like: A B C 1 2 3 4 5 6 7 8 9 I want to transform the above data frame in spark using pyspark as: A B C A_1 B_2 C_3 A_4 B_5 C_6 -------------- Then convert it to list of list using pyspark as: [[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]] And then run FP Growth algorithm using pyspark on the above data set. The code that I have tried is below: from pyspark.sql.functions import col, size from pyspark.sql