pyspark

Whats is the correct way to sum different dataframe columns in a list in pyspark?

南笙酒味 提交于 2020-01-01 11:57:30
问题 I want to sum different columns in a spark dataframe. Code from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"]))) Why isn't approach #2. & #3. not working? I am on Spark

Selecting values from non-null columns in a PySpark DataFrame

烂漫一生 提交于 2020-01-01 11:48:35
问题 There is a pyspark dataframe with missing values: tbl = sc.parallelize([ Row(first_name='Alice', last_name='Cooper'), Row(first_name='Prince', last_name=None), Row(first_name=None, last_name='Lenon') ]).toDF() tbl.show() Here's the table: +----------+---------+ |first_name|last_name| +----------+---------+ | Alice| Cooper| | Prince| null| | null| Lenon| +----------+---------+ I would like to create a new column as follows: if first name is None, take the last name if last name is None, take

How to install libraries to python in zeppelin-spark2 in HDP

懵懂的女人 提交于 2020-01-01 07:27:31
问题 I am using HDP Version: 2.6.4 Can you provide a step by step instructions on how to install libraries to the following python directory under spark2 ? The sc.version (spark version) returns res0: String = 2.2.0.2.6.4.0-91 The spark2 interpreter name and value is as following zeppelin.pyspark.python: /usr/local/Python-3.4.8/bin/python3.4 The python version and current libraries are %spark2.pyspark import pip import sys sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed

How do i setup Pyspark in Python 3 with spark-env.sh.template

这一生的挚爱 提交于 2020-01-01 07:13:11
问题 Because i have this issue in my ipython3 notebook, i guess i have to change "spark-env.sh.template" somehow. Exception: Python in worker has different version 2.7 than that in driver 3.4, PySpark cannot run with different minor versions 回答1: Spark does not yet work with Python 3.If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer). I had the same issue when running IPYTHON=1 ./pyspark . Ok quick fix Edit vim pyspark and change PYSPARK_DRIVER_PYTHON=

How to do mathematical operation with two column in dataframe using pyspark

拈花ヽ惹草 提交于 2020-01-01 05:40:32
问题 I have dataframe with three column "x" ,"y" and "z" x y z bn 12452 221 mb 14521 330 pl 12563 160 lo 22516 142 I need to create a another column which is derived by this formula (m = z / y+z) So the new data frameshould look something like this: x y z m bn 12452 221 .01743 mb 14521 330 .02222 pl 12563 160 .01257 lo 22516 142 .00626 回答1: df = sqlContext.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330)], ['x', 'y', 'z']) df = df.withColumn('m', df['z'] / (df['y'] + df['z'])) df.head(2) 来源

How to assign and use column headers in Spark?

烂漫一生 提交于 2020-01-01 05:23:05
问题 I am reading a dataset as below. f = sc.textFile("s3://test/abc.csv") My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script. How do I do that in PySpark ? Is DataFrame way to go here ? PS - Newbie to Spark. 回答1: Here is how to add column names using DataFrame: Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame: f = sc.textFile("s3://test/abc.csv") data_rdd = f.map(lambda line: [x for

Remove Empty Partitions from Spark RDD

若如初见. 提交于 2020-01-01 04:54:06
问题 I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed. Is there any other way to go about this? 回答1: There isn't an easy way to

Using spark-submit, what is the behavior of the --total-executor-cores option?

北城以北 提交于 2020-01-01 04:28:05
问题 I am running a spark cluster over C++ code wrapped in python. I am currently testing different configurations of multi-threading options (at Python level or Spark level). I am using spark with standalone binaries, over a HDFS 2.5.4 cluster. The cluster is currently made of 10 slaves, with 4 cores each. From what I can see, by default, Spark launches 4 slaves per node (I have 4 python working on a slave node at a time). How can I limit this number ? I can see that I have a --total-executor

What is the difference between rowsBetween and rangeBetween?

别来无恙 提交于 2020-01-01 03:58:14
问题 From the PySpark docs rangeBetween: rangeBetween(start, end) Defines the frame boundaries, from start (inclusive) to end (inclusive). Both start and end are relative from the current row. For example, “0” means “current row”, while “-1” means one off before the current row, and “5” means the five off after the current row. Parameters: start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower). end – boundary end, inclusive. The frame is unbounded if this is

pyspark approxQuantile function

為{幸葍}努か 提交于 2020-01-01 03:10:50
问题 I have dataframe with these columns id , price , timestamp . I would like to find median value grouped by id . I am using this code to find it but it's giving me this error. from pyspark.sql import DataFrameStatFunctions as statFunc windowSpec = Window.partitionBy("id") median = statFunc.approxQuantile("price", [0.5], 0) \ .over(windowSpec) return df.withColumn("Median", median) Is it not possible to use DataFrameStatFunctions to fill values in new column? TypeError: unbound method