pyspark | 易学教程

how to modify one column value in one row used by pyspark

阅读更多关于 how to modify one column value in one row used by pyspark

问题 I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping. >>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False) 18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally) 18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally) +------

how to modify one column value in one row used by pyspark

阅读更多关于 how to modify one column value in one row used by pyspark

Spark FileAlreadyExistsException on Stage Failure

阅读更多关于 Spark FileAlreadyExistsException on Stage Failure

问题 I am trying to write a dataframe to s3 location after re-partitioning. But whenever the write stage fails and Spark retry the stage it throws FileAlreadyExistsException. When I re-submit the job it works fine if spark completes the stage in one try. Below is my code block df.repartition(<some-value>).write.format("orc").option("compression", "zlib").mode("Overwrite").save(path) I believe Spark should remove files from the failed stage before retry. I understand this will be solved if we set

Why does pyspark fail with “Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.”?

阅读更多关于 Why does pyspark fail with “Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.”?

问题 I am using a standalone cluster of apache spark version 2.0.0 with two nodes and i have not installed hive.I am getting the following error on creating a dataframe. from pyspark import SparkContext from pyspark import SQLContext sqlContext = SQLContext(sc) l = [('Alice', 1)] sqlContext.createDataFrame(l).collect() --------------------------------------------------------------------------- IllegalArgumentException Traceback (most recent call last) <ipython-input-9-63bc4f21f23e> in <module>() -

Pickling monkey-patched Keras model for use in PySpark

阅读更多关于 Pickling monkey-patched Keras model for use in PySpark

问题 The overall goal of what I am trying to achieve is sending a Keras model to each spark worker so that I can use the model within a UDF applied to a column of a DataFrame. To do this, the Keras model will need to be picklable. It seems like a lot of people have had success at pickling keras models by monkey patching the Model class as shown by the link below: http://zachmoshe.com/2017/04/03/pickling-keras-models.html However, I have not seen any example of how to do this in tandem with Spark.

Get date from two different timestamp formats in one pyspark dataframe [duplicate]

阅读更多关于 Get date from two different timestamp formats in one pyspark dataframe [duplicate]

问题 This question already has an answer here : Cast column containing multiple string date formats to DateTime in Spark (1 answer) Closed 6 days ago . I have a pyspark dataframe that has a timestamp field. But it contains two types of timestamp format (both are strings). +----------------------+ | timestamp | +---------------------+ | 06-06-2019,17:15:46| +---------------------+ |2020-01-01T06:07:22.000Z How can I create another "date"column in the same pyspark dataframe that captures only the

How to write PySpark dataframe to DynamoDB table?

阅读更多关于 How to write PySpark dataframe to DynamoDB table?

问题 How to write PySpark dataframe to DynamoDB table? Did not find much info on this. As per my requirement, i have to write PySpark dataframe to Dynamo db table. Overall i need to read/write to dynamo from my PySpark code. Thanks in advance. 回答1: Ram, there's no way to do that directly from pyspark. If you have pipeline software running it can be done in a series of steps. Here is how it can be done: Create a temporary hive table like CREATE TABLE TEMP( column1 type, column2 type...) STORED AS

How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

阅读更多关于 How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

问题 I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook . I want to set spark.driver.memory to 9Gb by doing this: spark = SparkSession.builder \ .master("local[2]") \ .appName("test") \ .config("spark.driver.memory", "9g")\ .getOrCreate() sc = spark.sparkContext from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark.sparkContext._conf.getAll() # check the config It returns [('spark.driver.memory', '9g'), ('spark.driver.cores', '4')

Installing find spark in virtual environment

阅读更多关于 Installing find spark in virtual environment

问题 I am using pyenv to create a virtual environment. My pyenv packages are located under the project bio in /.pyenv/versions/bio/lib/python3.7/site-packages I installed findspark using below pip install findspark #it was installed successfully. I am able to see the below files in the packages directory. findspark-1.4.2.dist-info findspark.py However, when I launch Jupyter notebook from the pyenv directory, I get an error message import findspark findspark.init() ImportError: No module named

Installing find spark in virtual environment

阅读更多关于 Installing find spark in virtual environment