pyspark

Is there an API function to display “Fraction Cached” for an RDD?

落爺英雄遲暮 提交于 2021-02-07 10:59:26
问题 On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached . How can I retrieve this percentage programatically? I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached . Do I have to calculate it myself? 回答1: SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about: Memory size.

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

南笙酒味 提交于 2021-02-07 10:53:15
问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

流过昼夜 提交于 2021-02-07 10:52:14
问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

Find all permutations of values in Spark RDD; python

末鹿安然 提交于 2021-02-07 10:51:39
问题 I have a spark RDD (myData) that has been mapped as a list. The output of myData.collect() yields the following: ['x', 'y', 'z'] What operation can I perform on myData to map to or create a new RDD containing a list of all permutations of xyz? For example newData.collect() would output: ['xyz', 'xzy', 'zxy', 'zyx', 'yxz', 'yzx'] I've tried using variations of cartesian(myData), but as far as I can tell, the best that gives is different combinations of two-value pairs. 回答1: Doing this all in

Spark SQL to Hive table - Datetime Field Hours Bug

孤街浪徒 提交于 2021-02-07 10:42:14
问题 I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! Let me explain: I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala

Spark SQL to Hive table - Datetime Field Hours Bug

怎甘沉沦 提交于 2021-02-07 10:41:39
问题 I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! Let me explain: I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala

How can I resolve “SparkException: Exception thrown in Future.get” issue?

时光怂恿深爱的人放手 提交于 2021-02-07 09:00:20
问题 I'm working on two pyspark dataframes and doing a left-anti join on them to track everyday changes and then send an email. The first time I tried: diff = Table_a.join( Table_b, [Table_a.col1== Table_b.col1, Table_a.col2== Table_b.col2], how='left_anti' ) Expected output is a pyspark dataframe with some or no data. This diff dataframe gets it's schema from Table_a. The first time I ran it, showed no data as expected with the schema representation. The next time onwards just throws

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

假装没事ソ 提交于 2021-02-07 08:17:51
问题 The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False) The indicator field combined with Panda's value_counts() function can be used to quickly determine how well a join performed. Example: In [48]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left':['a', 'b']}) In [49]: df2 = pd.DataFrame({'col1': [1, 2, 2],'col_right':

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

浪子不回头ぞ 提交于 2021-02-07 08:16:41
问题 The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False) The indicator field combined with Panda's value_counts() function can be used to quickly determine how well a join performed. Example: In [48]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left':['a', 'b']}) In [49]: df2 = pd.DataFrame({'col1': [1, 2, 2],'col_right':

Cannot save model using PySpark xgboost4j

女生的网名这么多〃 提交于 2021-02-07 08:12:20
问题 I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form. The training is done, but It seems I cannot save the model. Current libraries versions: Pyspark 2.4.0 xgboost4j 0.90 xgboost4j-spark 0.90 Spark submit args: os.environ['PYSPARK_SUBMIT_ARGS'] = "--py-files dist/DNA-0.0.2-py3.6.egg " \ "--jars dna/resources/xgboost4j-spark-0.90.jar," \ "dna/resources/xgboost4j-0.90.jar pyspark-shell" The training process is as