pyspark

how to modify one column value in one row used by pyspark

有些话、适合烂在心里 提交于 2020-08-24 09:36:26
问题 I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping. >>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False) 18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally) 18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally) +------

how to modify one column value in one row used by pyspark

本秂侑毒 提交于 2020-08-24 09:36:17
问题 I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping. >>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False) 18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally) 18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally) +------

Spark FileAlreadyExistsException on Stage Failure

爷,独闯天下 提交于 2020-08-24 07:57:05
问题 I am trying to write a dataframe to s3 location after re-partitioning. But whenever the write stage fails and Spark retry the stage it throws FileAlreadyExistsException. When I re-submit the job it works fine if spark completes the stage in one try. Below is my code block df.repartition(<some-value>).write.format("orc").option("compression", "zlib").mode("Overwrite").save(path) I believe Spark should remove files from the failed stage before retry. I understand this will be solved if we set

Why does pyspark fail with “Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.”?

陌路散爱 提交于 2020-08-24 06:55:08
问题 I am using a standalone cluster of apache spark version 2.0.0 with two nodes and i have not installed hive.I am getting the following error on creating a dataframe. from pyspark import SparkContext from pyspark import SQLContext sqlContext = SQLContext(sc) l = [('Alice', 1)] sqlContext.createDataFrame(l).collect() --------------------------------------------------------------------------- IllegalArgumentException Traceback (most recent call last) <ipython-input-9-63bc4f21f23e> in <module>() -

Pickling monkey-patched Keras model for use in PySpark

旧城冷巷雨未停 提交于 2020-08-21 19:50:35
问题 The overall goal of what I am trying to achieve is sending a Keras model to each spark worker so that I can use the model within a UDF applied to a column of a DataFrame. To do this, the Keras model will need to be picklable. It seems like a lot of people have had success at pickling keras models by monkey patching the Model class as shown by the link below: http://zachmoshe.com/2017/04/03/pickling-keras-models.html However, I have not seen any example of how to do this in tandem with Spark.

Get date from two different timestamp formats in one pyspark dataframe [duplicate]

混江龙づ霸主 提交于 2020-08-19 11:12:28
问题 This question already has an answer here : Cast column containing multiple string date formats to DateTime in Spark (1 answer) Closed 6 days ago . I have a pyspark dataframe that has a timestamp field. But it contains two types of timestamp format (both are strings). +----------------------+ | timestamp | +---------------------+ | 06-06-2019,17:15:46| +---------------------+ |2020-01-01T06:07:22.000Z How can I create another "date"column in the same pyspark dataframe that captures only the

How to write PySpark dataframe to DynamoDB table?

此生再无相见时 提交于 2020-08-19 10:50:27
问题 How to write PySpark dataframe to DynamoDB table? Did not find much info on this. As per my requirement, i have to write PySpark dataframe to Dynamo db table. Overall i need to read/write to dynamo from my PySpark code. Thanks in advance. 回答1: Ram, there's no way to do that directly from pyspark. If you have pipeline software running it can be done in a series of steps. Here is how it can be done: Create a temporary hive table like CREATE TABLE TEMP( column1 type, column2 type...) STORED AS

How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

[亡魂溺海] 提交于 2020-08-19 05:33:05
问题 I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook . I want to set spark.driver.memory to 9Gb by doing this: spark = SparkSession.builder \ .master("local[2]") \ .appName("test") \ .config("spark.driver.memory", "9g")\ .getOrCreate() sc = spark.sparkContext from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark.sparkContext._conf.getAll() # check the config It returns [('spark.driver.memory', '9g'), ('spark.driver.cores', '4')

Installing find spark in virtual environment

*爱你&永不变心* 提交于 2020-08-11 18:47:07
问题 I am using pyenv to create a virtual environment. My pyenv packages are located under the project bio in /.pyenv/versions/bio/lib/python3.7/site-packages I installed findspark using below pip install findspark #it was installed successfully. I am able to see the below files in the packages directory. findspark-1.4.2.dist-info findspark.py However, when I launch Jupyter notebook from the pyenv directory, I get an error message import findspark findspark.init() ImportError: No module named

Installing find spark in virtual environment

ぃ、小莉子 提交于 2020-08-11 18:45:59
问题 I am using pyenv to create a virtual environment. My pyenv packages are located under the project bio in /.pyenv/versions/bio/lib/python3.7/site-packages I installed findspark using below pip install findspark #it was installed successfully. I am able to see the below files in the packages directory. findspark-1.4.2.dist-info findspark.py However, when I launch Jupyter notebook from the pyenv directory, I get an error message import findspark findspark.init() ImportError: No module named