pyspark | 易学教程

While submit job with pyspark, how to access static files upload with --files argument?

阅读更多关于 While submit job with pyspark, how to access static files upload with --files argument?

问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

pyspark how to return the average of a column based on the value of another column?

阅读更多关于 pyspark how to return the average of a column based on the value of another column?

问题 I wouldn't expect this to be difficult, but I'm having trouble understanding how to take the average of a column in my spark dataframe. The dataframe looks like: +-------+------------+--------+------------------+ |Private|Applications|Accepted| Rate| +-------+------------+--------+------------------+ | Yes| 417| 349|0.8369304556354916| | Yes| 1899| 1720|0.9057398630858347| | Yes| 1732| 1425|0.8227482678983834| | Yes| 494| 313|0.6336032388663968| | No| 3540| 2001|0.5652542372881356| | No| 7313

Read ORC files directly from Spark shell

阅读更多关于 Read ORC files directly from Spark shell

问题 I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala). I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html . from pyspark.sql import HiveContext hiveCtx = HiveContext(sc) inputRead = sc.hadoopFile("hdfs://user@server:/file_path", classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc

Reading a text file with multiple headers in Spark

阅读更多关于 Reading a text file with multiple headers in Spark

问题 I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame STN--- WBAN YEARMODA TEMP 010010 99999 20060101 33.5 23 010010 99999 20060102 35.3 23 010010 99999 20060103 34.4 24 STN--- WBAN YEARMODA TEMP 010010 99999 20060120 35.2 22 010010 99999 20060121 32.2 21 010010 99999 20060122 33.0 22 回答1: You can read the text file as a normal text file in an

Reading a text file with multiple headers in Spark

阅读更多关于 Reading a text file with multiple headers in Spark

'Column' object is not callable with Regex and Pyspark

阅读更多关于 'Column' object is not callable with Regex and Pyspark

问题 I need to extract the integers only from url stings in the column "Page URL" and append those extracted integers to a new column. I am using PySpark. My code below: from pyspark.sql.functions import col, regexp_extract spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).show()) I have the following error: TypeError: 'Column' object is not callable. 回答1: You may use spark_df_url.withColumn("new_column", regexp_extract("Page URL", "\d+", 0)) Specify the name of the

'Column' object is not callable with Regex and Pyspark

阅读更多关于 'Column' object is not callable with Regex and Pyspark

Reshaping Spark RDD

阅读更多关于 Reshaping Spark RDD

问题 I have a Spark RDD as follows: rdd = sc.parallelize([('X01','Y01'), ('X01','Y02'), ('X01','Y03'), ('X02','Y01'), ('X02','Y06')]) I would like to convert them into the following format: [('X01',('Y01','Y02','Y03')), ('X02',('Y01','Y06'))] Can someone help me how to achieve this using PySpark? 回答1: A simple groupByKey operation is what you need. rdd.groupByKey().mapValues(lambda x: tuple(x.data)).collect() Result: [('X02', ('Y01', 'Y06')), ('X01', ('Y01', 'Y02', 'Y03'))] 回答2: Convert the RDD to

How to convert a pyspark dataframe column to numpy array

阅读更多关于 How to convert a pyspark dataframe column to numpy array

问题 I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array. I need the array as an input for scipy.optimize.minimize function. I have tried both converting to Pandas and using collect() , but these methods are very time consuming. I am new to PySpark, If there is a faster and better approach to do this, Please help. Thanks This is how my dataframe looks like. +----------+ |Adolescent| +----------+ | 0.0| | 0.0| | 0.0| | 0.0| | 0.0| | 0.0| | 0.0|

Identify Partition Key Column from a table using PySpark

阅读更多关于 Identify Partition Key Column from a table using PySpark

问题 I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table. It would be great if the result would also include the datatype of the partitioned columns. Any suggestions will be helpful. 回答1: It can be done using desc as shown below: df=spark.sql("""desc test_dev_db.partition_date_table""") >>> df.show(truncate=False) +-----------