pyspark

How to run spark-submit in virtualenv for pyspark?

喜欢而已 提交于 2021-02-11 12:21:57
问题 Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit , but attempting to do so I get... [me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError

How to run spark-submit in virtualenv for pyspark?

允我心安 提交于 2021-02-11 12:21:33
问题 Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit , but attempting to do so I get... [me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError

Groupby and collect_list maintaining order based on another column in PySpark

痴心易碎 提交于 2021-02-11 12:17:49
问题 I have a PySpark dataframe like this, +----------+------------+------------+------------+ | Name | dateCol1 | dateCol2 | dateCol3 | +----------+------------+------------+------------+ | user1 | 2018-01-01 | 2018-01-10 | 2018-01-01 | | user1 | 2018-01-11 | 2018-01-20 | 2018-01-01 | | user2 | 2018-01-11 | 2018-01-20 | 2018-01-11 | | user1 | 2019-01-21 | 2018-01-30 | 2018-01-01 | +----------+------------+------------+------------+ I want to groupby this dataset on keys, dateCol1 and dateCol2 and

Groupby and collect_list maintaining order based on another column in PySpark

霸气de小男生 提交于 2021-02-11 12:16:06
问题 I have a PySpark dataframe like this, +----------+------------+------------+------------+ | Name | dateCol1 | dateCol2 | dateCol3 | +----------+------------+------------+------------+ | user1 | 2018-01-01 | 2018-01-10 | 2018-01-01 | | user1 | 2018-01-11 | 2018-01-20 | 2018-01-01 | | user2 | 2018-01-11 | 2018-01-20 | 2018-01-11 | | user1 | 2019-01-21 | 2018-01-30 | 2018-01-01 | +----------+------------+------------+------------+ I want to groupby this dataset on keys, dateCol1 and dateCol2 and

SQL or Pyspark - Get the last time a column had a different value for each ID

爱⌒轻易说出口 提交于 2021-02-11 12:14:05
问题 I am using pyspark so I have tried both pyspark code and SQL. I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A| 20| | 4| 1| A| 40| | 5| 1| A| 45| +---+-------+-------+----+ The correct new column I would like is as below: +---+-------+-------+----+---------+ | ID|USER_ID

Write data from pyspark to azure blob?

℡╲_俬逩灬. 提交于 2021-02-11 07:20:45
问题 I want to write dataframe from pyspark to azure blob? Any suggestions or code how to do it? I have location and key of blob enter image description here 回答1: You could follow this tutorial to connector your spark dataframe with Azure Blob Storage. Set connection info: session.conf.set( "fs.azure.account.key.<storage-account-name>.blob.core.windows.net", "<your-storage-account-access-key>" ) Then write data into blob storage: sdf = session.write.parquet( "wasbs://<container-name>@<storage

Write data from pyspark to azure blob?

只愿长相守 提交于 2021-02-11 07:20:05
问题 I want to write dataframe from pyspark to azure blob? Any suggestions or code how to do it? I have location and key of blob enter image description here 回答1: You could follow this tutorial to connector your spark dataframe with Azure Blob Storage. Set connection info: session.conf.set( "fs.azure.account.key.<storage-account-name>.blob.core.windows.net", "<your-storage-account-access-key>" ) Then write data into blob storage: sdf = session.write.parquet( "wasbs://<container-name>@<storage

Create dataframe with schema provided as JSON file

戏子无情 提交于 2021-02-11 01:56:22
问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Create dataframe with schema provided as JSON file

◇◆丶佛笑我妖孽 提交于 2021-02-11 01:55:13
问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Not able to set number of shuffle partition in pyspark

℡╲_俬逩灬. 提交于 2021-02-10 19:57:54
问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up