pyspark | 易学教程

How to run spark-submit in virtualenv for pyspark?

阅读更多关于 How to run spark-submit in virtualenv for pyspark?

问题 Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit , but attempting to do so I get... [me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError

How to run spark-submit in virtualenv for pyspark?

阅读更多关于 How to run spark-submit in virtualenv for pyspark?

Groupby and collect_list maintaining order based on another column in PySpark

阅读更多关于 Groupby and collect_list maintaining order based on another column in PySpark

问题 I have a PySpark dataframe like this, +----------+------------+------------+------------+ | Name | dateCol1 | dateCol2 | dateCol3 | +----------+------------+------------+------------+ | user1 | 2018-01-01 | 2018-01-10 | 2018-01-01 | | user1 | 2018-01-11 | 2018-01-20 | 2018-01-01 | | user2 | 2018-01-11 | 2018-01-20 | 2018-01-11 | | user1 | 2019-01-21 | 2018-01-30 | 2018-01-01 | +----------+------------+------------+------------+ I want to groupby this dataset on keys, dateCol1 and dateCol2 and

Groupby and collect_list maintaining order based on another column in PySpark

阅读更多关于 Groupby and collect_list maintaining order based on another column in PySpark

SQL or Pyspark - Get the last time a column had a different value for each ID

阅读更多关于 SQL or Pyspark - Get the last time a column had a different value for each ID

问题 I am using pyspark so I have tried both pyspark code and SQL. I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A| 20| | 4| 1| A| 40| | 5| 1| A| 45| +---+-------+-------+----+ The correct new column I would like is as below: +---+-------+-------+----+---------+ | ID|USER_ID

Write data from pyspark to azure blob?

阅读更多关于 Write data from pyspark to azure blob?

问题 I want to write dataframe from pyspark to azure blob? Any suggestions or code how to do it? I have location and key of blob enter image description here 回答1: You could follow this tutorial to connector your spark dataframe with Azure Blob Storage. Set connection info: session.conf.set( "fs.azure.account.key.<storage-account-name>.blob.core.windows.net", "<your-storage-account-access-key>" ) Then write data into blob storage: sdf = session.write.parquet( "wasbs://<container-name>@<storage

Write data from pyspark to azure blob?

阅读更多关于 Write data from pyspark to azure blob?

Create dataframe with schema provided as JSON file

阅读更多关于 Create dataframe with schema provided as JSON file

问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Create dataframe with schema provided as JSON file

阅读更多关于 Create dataframe with schema provided as JSON file

Not able to set number of shuffle partition in pyspark

阅读更多关于 Not able to set number of shuffle partition in pyspark

问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up