pyspark

SparkSQL: conditional sum using two columns

和自甴很熟 提交于 2019-12-21 05:35:31
问题 I hope you can help me with this. I have a DF as follows: val df = sc.parallelize(Seq( (1, "a", "2014-12-01", "2015-01-01", 100), (2, "a", "2014-12-01", "2015-01-02", 150), (3, "a", "2014-12-01", "2015-01-03", 120), (4, "b", "2015-12-15", "2015-01-01", 100) )).toDF("id", "prodId", "dateIns", "dateTrans", "value") .withColumn("dateIns", to_date($"dateIns") .withColumn("dateTrans", to_date($"dateTrans")) I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates

PySpark explode stringified array of dictionaries into rows

怎甘沉沦 提交于 2019-12-21 05:25:11
问题 I have a pyspark dataframe with StringType column ( edges ), which contains a list of dictionaries (see example below). The dictionaries contain a mix of value types, including another dictionary ( nodeIDs ). I need to explode the top-level dictionaries in the edges field into rows; ideally, I should then be able to convert their component values into separate fields. Input: import findspark findspark.init() SPARK = SparkSession.builder.enableHiveSupport() \ .getOrCreate() data = [ Row(trace

Spark Python Performance Tuning

て烟熏妆下的殇ゞ 提交于 2019-12-21 04:51:52
问题 I brought up a iPython notebook for Spark development using the command below: ipython notebook --profile=pyspark And I created a sc SparkContext using the Python code like this: import sys import os os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf" sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python") sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip") from pyspark import SparkContext, SparkConf from pyspark.sql import * sconf = SparkConf() conf =

How to calculate mean and standard deviation given a PySpark DataFrame?

不想你离开。 提交于 2019-12-21 04:44:06
问题 I have PySpark DataFrame ( not pandas ) called df that is quite large to use collect() . Therefore the below-given code is not efficient. It was working with a smaller amount of data, however now it fails. import numpy as np myList = df.collect() total = [] for product,nb in myList: for p2,score in nb: total.append(score) mean = np.mean(total) std = np.std(total) Is there any way to get mean and std as two variables by using pyspark.sql.functions or similar? from pyspark.sql.functions import

Read a bytes column in spark

谁都会走 提交于 2019-12-21 04:33:10
问题 I have a data set which contains an ID field that is in an unknown (and not friendly) encoding. I can read the single column using plain python and verify that the values are distinct and consistent across multiple data sets (i.e. it can be used as a primary key for joining). When loading the file using spark.read.csv , it seems that spark is converting the column to utf-8 . However, some of the multibyte sequences are converted to the Unicode character U+FFFD REPLACEMENT CHARACTER. ( EF BF

Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

久未见 提交于 2019-12-21 04:02:24
问题 I am trying to use a Spark cluster (running on AWS EMR) to link groups of items that have common elements in them. Essentially, I have groups with some elements and if some of the elements are in multiple groups, I want to make one group that contains elements from all of those groups. I know about GraphX library and I tried to use graphframes package ( ConnectedComponents algorithm) to resolve this task, but it seams that the graphframes package is not yet mature enough and is very wasteful

How can set the default spark logging level?

廉价感情. 提交于 2019-12-21 03:57:34
问题 I launch pyspark applications from pycharm on my own workstation, to a 8 node cluster. This cluster also has settings encoded in spark-defaults.conf and spark-env.sh This is how I obtain my spark context variable. spark = SparkSession \ .builder \ .master("spark://stcpgrnlp06p.options-it.com:7087") \ .appName(__SPARK_APP_NAME__) \ .config("spark.executor.memory", "50g") \ .config("spark.eventlog.enabled", "true") \ .config("spark.eventlog.dir", r"/net/share/grid/bin/spark/UAT/SparkLogs/") \

Pandas-style transform of grouped data on PySpark DataFrame

回眸只為那壹抹淺笑 提交于 2019-12-21 03:57:05
问题 If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following: df["DemeanedValues"] = df.groupby("Category")["Values"].transform(lambda g: g - numpy.mean(g)) As far as I understand, Spark dataframes do not directly offer this group-by/transform operation (I am using PySpark on Spark 1.5.0). So, what is the best way to implement this computation? I have tried using a group-by/join as follows: df2 = df

Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark

依然范特西╮ 提交于 2019-12-21 03:55:23
问题 I am new apache-spark. I have tested some application in spark standalone mode.but I want to run application yarn mode.I am running apache-spark 2.1.0 in windows.Here is My code c:\spark>spark-submit2 --master yarn --deploy-mode client --executor-cores 4 --jars C:\DependencyJars\spark-streaming-eventhubs_2.11-2.0.3.jar,C:\DependencyJars\scalaj-http_2.11-2.3.0.jar,C:\DependencyJars\config-1.3.1.jar,C:\DependencyJars\commons-lang3-3.3.2.jar --conf spark.driver.userClasspathFirst=true --conf

How does Spark interoperate with CPython

有些话、适合烂在心里 提交于 2019-12-21 03:42:16
问题 I have an Akka system written in scala that needs to call out to some Python code, relying on Pandas and Numpy , so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form. 回答1: PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is