pyspark

While submit job with pyspark, how to access static files upload with --files argument?

廉价感情. 提交于 2020-02-16 11:31:06
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be

pyspark how to return the average of a column based on the value of another column?

孤者浪人 提交于 2020-02-16 10:39:08
问题 I wouldn't expect this to be difficult, but I'm having trouble understanding how to take the average of a column in my spark dataframe. The dataframe looks like: +-------+------------+--------+------------------+ |Private|Applications|Accepted| Rate| +-------+------------+--------+------------------+ | Yes| 417| 349|0.8369304556354916| | Yes| 1899| 1720|0.9057398630858347| | Yes| 1732| 1425|0.8227482678983834| | Yes| 494| 313|0.6336032388663968| | No| 3540| 2001|0.5652542372881356| | No| 7313

Read ORC files directly from Spark shell

陌路散爱 提交于 2020-02-13 08:54:45
问题 I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala). I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html . from pyspark.sql import HiveContext hiveCtx = HiveContext(sc) inputRead = sc.hadoopFile("hdfs://user@server:/file_path", classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc

Reading a text file with multiple headers in Spark

て烟熏妆下的殇ゞ 提交于 2020-02-08 10:05:08
问题 I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame STN--- WBAN YEARMODA TEMP 010010 99999 20060101 33.5 23 010010 99999 20060102 35.3 23 010010 99999 20060103 34.4 24 STN--- WBAN YEARMODA TEMP 010010 99999 20060120 35.2 22 010010 99999 20060121 32.2 21 010010 99999 20060122 33.0 22 回答1: You can read the text file as a normal text file in an

Reading a text file with multiple headers in Spark

纵饮孤独 提交于 2020-02-08 10:02:14
问题 I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame STN--- WBAN YEARMODA TEMP 010010 99999 20060101 33.5 23 010010 99999 20060102 35.3 23 010010 99999 20060103 34.4 24 STN--- WBAN YEARMODA TEMP 010010 99999 20060120 35.2 22 010010 99999 20060121 32.2 21 010010 99999 20060122 33.0 22 回答1: You can read the text file as a normal text file in an

'Column' object is not callable with Regex and Pyspark

删除回忆录丶 提交于 2020-02-08 09:59:26
问题 I need to extract the integers only from url stings in the column "Page URL" and append those extracted integers to a new column. I am using PySpark. My code below: from pyspark.sql.functions import col, regexp_extract spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).show()) I have the following error: TypeError: 'Column' object is not callable. 回答1: You may use spark_df_url.withColumn("new_column", regexp_extract("Page URL", "\d+", 0)) Specify the name of the

'Column' object is not callable with Regex and Pyspark

你说的曾经没有我的故事 提交于 2020-02-08 09:58:28
问题 I need to extract the integers only from url stings in the column "Page URL" and append those extracted integers to a new column. I am using PySpark. My code below: from pyspark.sql.functions import col, regexp_extract spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).show()) I have the following error: TypeError: 'Column' object is not callable. 回答1: You may use spark_df_url.withColumn("new_column", regexp_extract("Page URL", "\d+", 0)) Specify the name of the

Reshaping Spark RDD

社会主义新天地 提交于 2020-02-08 05:10:25
问题 I have a Spark RDD as follows: rdd = sc.parallelize([('X01','Y01'), ('X01','Y02'), ('X01','Y03'), ('X02','Y01'), ('X02','Y06')]) I would like to convert them into the following format: [('X01',('Y01','Y02','Y03')), ('X02',('Y01','Y06'))] Can someone help me how to achieve this using PySpark? 回答1: A simple groupByKey operation is what you need. rdd.groupByKey().mapValues(lambda x: tuple(x.data)).collect() Result: [('X02', ('Y01', 'Y06')), ('X01', ('Y01', 'Y02', 'Y03'))] 回答2: Convert the RDD to

How to convert a pyspark dataframe column to numpy array

一曲冷凌霜 提交于 2020-02-07 05:15:09
问题 I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array. I need the array as an input for scipy.optimize.minimize function. I have tried both converting to Pandas and using collect() , but these methods are very time consuming. I am new to PySpark, If there is a faster and better approach to do this, Please help. Thanks This is how my dataframe looks like. +----------+ |Adolescent| +----------+ | 0.0| | 0.0| | 0.0| | 0.0| | 0.0| | 0.0| | 0.0|

Identify Partition Key Column from a table using PySpark

此生再无相见时 提交于 2020-02-05 03:46:09
问题 I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table. It would be great if the result would also include the datatype of the partitioned columns. Any suggestions will be helpful. 回答1: It can be done using desc as shown below: df=spark.sql("""desc test_dev_db.partition_date_table""") >>> df.show(truncate=False) +-----------