pyspark | 易学教程

Is there an API function to display “Fraction Cached” for an RDD?

阅读更多关于 Is there an API function to display “Fraction Cached” for an RDD?

问题 On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached . How can I retrieve this percentage programatically? I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached . Do I have to calculate it myself? 回答1: SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about: Memory size.

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

阅读更多关于 How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

问题 I'm looking for a help, how to parse json string with multiple keys to json struct, see required output . Answer below shows how to transform JSON string with one Id : jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}' How to parse and transform json string from spark data frame rows in pyspark How to transform thousands of Ids in jstr1 , jstr2 , when number of Ids per JSON string change in each string. Current Code: jstr1 = """ {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], "id_2": [

How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

阅读更多关于 How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

Find all permutations of values in Spark RDD; python

阅读更多关于 Find all permutations of values in Spark RDD; python

问题 I have a spark RDD (myData) that has been mapped as a list. The output of myData.collect() yields the following: ['x', 'y', 'z'] What operation can I perform on myData to map to or create a new RDD containing a list of all permutations of xyz? For example newData.collect() would output: ['xyz', 'xzy', 'zxy', 'zyx', 'yxz', 'yzx'] I've tried using variations of cartesian(myData), but as far as I can tell, the best that gives is different combinations of two-value pairs. 回答1: Doing this all in

Spark SQL to Hive table - Datetime Field Hours Bug

阅读更多关于 Spark SQL to Hive table - Datetime Field Hours Bug

问题 I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! Let me explain: I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala

Spark SQL to Hive table - Datetime Field Hours Bug

阅读更多关于 Spark SQL to Hive table - Datetime Field Hours Bug

How can I resolve “SparkException: Exception thrown in Future.get” issue?

阅读更多关于 How can I resolve “SparkException: Exception thrown in Future.get” issue?

问题 I'm working on two pyspark dataframes and doing a left-anti join on them to track everyday changes and then send an email. The first time I tried: diff = Table_a.join( Table_b, [Table_a.col1== Table_b.col1, Table_a.col2== Table_b.col2], how='left_anti' ) Expected output is a pyspark dataframe with some or no data. This diff dataframe gets it's schema from Table_a. The first time I ran it, showed no data as expected with the schema representation. The next time onwards just throws

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

阅读更多关于 Does Spark Dataframe have an equivalent option of Panda's merge indicator?

问题 The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False) The indicator field combined with Panda's value_counts() function can be used to quickly determine how well a join performed. Example: In [48]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left':['a', 'b']}) In [49]: df2 = pd.DataFrame({'col1': [1, 2, 2],'col_right':

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

阅读更多关于 Does Spark Dataframe have an equivalent option of Panda's merge indicator?

Cannot save model using PySpark xgboost4j

阅读更多关于 Cannot save model using PySpark xgboost4j

问题 I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form. The training is done, but It seems I cannot save the model. Current libraries versions: Pyspark 2.4.0 xgboost4j 0.90 xgboost4j-spark 0.90 Spark submit args: os.environ['PYSPARK_SUBMIT_ARGS'] = "--py-files dist/DNA-0.0.2-py3.6.egg " \ "--jars dna/resources/xgboost4j-spark-0.90.jar," \ "dna/resources/xgboost4j-0.90.jar pyspark-shell" The training process is as