pyspark-sql | 易学教程

Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

阅读更多关于 Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so: feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \ outputCol="features") pipeline = Pipeline(stages=[feature_assembler]) model = pipeline.fit(df_train) model_output = model.transform(df_train) When I try to look at the output using model_output.select("features").show(1) I get the error Py4JJavaError Traceback (most recent call last) <ipython-input-95-7a3e3d4f281c> in <module>() 2 3 ----> 4 model_output.select("features").show(1) /usr/local/spark20/python/pyspark/sql

Fill Pyspark dataframe column null values with average value from same column

阅读更多关于 Fill Pyspark dataframe column null values with average value from same column

问题 With a dataframe like this, rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")]) df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"]) df_data.show() +---+----+----+-------+ | id|type|cost| date| +---+----+----+-------+ | 0| 10| 223| 201601| | 0| 10| 83|2016032| | 1| 20|null| 201602| | 1| 20|3003| 201601| | 1| 20|null| 201603| |

Filter array column content

阅读更多关于 Filter array column content

问题 I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf: >>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"]) >>> df.show() +----+----+---------------+ |col1|col2| col3| +----+----+---------------+ | 1| A| [1, 2, 3, 4]| | 2| B|[1, 2, 3, 4, 5]| +----+----+---------------+ The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are

Does Apache Spark load entire data from target database?

阅读更多关于 Does Apache Spark load entire data from target database?

问题 I want to use Apache Spark and connect to Vertica by JDBC. In Vertica database, I have 100 million records and spark code runs on another server. When I run the query in Spark and monitor network usage, traffic between two servers is very high. It seems Spark loads all data from target server. this is my code: test_df = spark.read.format("jdbc") .option("url" , url).option("dbtable", "my_table") .option("user", "user").option("password" , "pass").load() test_df.createOrReplaceTempView('tb')

Converting yyyymmdd to MM-dd-yyyy format in pyspark

阅读更多关于 Converting yyyymmdd to MM-dd-yyyy format in pyspark

I have a large data frame df containing a column for date in the format yyyymmdd , how can I convert it into MM-dd-yyyy in pySpark. from datetime import datetime from pyspark.sql.functions import col,udf from pyspark.sql.types import DateType rdd = sc.parallelize(['20161231', '20140102', '20151201', '20161124']) df1 = sqlContext.createDataFrame(rdd, ['old_col']) //UDF to convert string to date func = udf (lambda x: datetime.strptime(x, '%Y%M%d'), DateType()) df = df1.withColumn('new_col', date_format(func(col('old_col')), 'MM-dd-yyy')) df.show() This is also working: from datetime import

Converting yyyymmdd to MM-dd-yyyy format in pyspark

阅读更多关于 Converting yyyymmdd to MM-dd-yyyy format in pyspark

问题 I have a large data frame df containing a column for date in the format yyyymmdd , how can I convert it into MM-dd-yyyy in pySpark. 回答1: from datetime import datetime from pyspark.sql.functions import col,udf from pyspark.sql.types import DateType rdd = sc.parallelize(['20161231', '20140102', '20151201', '20161124']) df1 = sqlContext.createDataFrame(rdd, ['old_col']) //UDF to convert string to date func = udf (lambda x: datetime.strptime(x, '%Y%M%d'), DateType()) df = df1.withColumn('new_col'

is there any pyspark function for add next month like DATE_ADD(date, month(int type))

阅读更多关于 is there any pyspark function for add next month like DATE_ADD(date, month(int type))

问题 I am new in spark , is there any built in function which will show next month date from current date like today is 27-12-2016 then the function will return 27-01-2017. i have used date_add() but no function for adding month. I have tried date_add(date, 31)but what if the month has 30 days . spark.sql("select date_add(current_date(),31)") .show() could anyone help me about this problem. do i need to write custom function for that ? cause i didn't find any built in code still Thanks in advance

How to cast string to ArrayType of dictionary (JSON) in PySpark

阅读更多关于 How to cast string to ArrayType of dictionary (JSON) in PySpark

问题 Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV. Using pyspark on Spark2 The CSV file I am dealing with; is as follows - date,attribute2,count,attribute3 2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]' 2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]' As shown above, it contains one attribute "attribute3" in literal string, which is

How to implement auto increment in spark SQL(PySpark)

阅读更多关于 How to implement auto increment in spark SQL(PySpark)

问题 I need to implement a auto increment column in my spark sql table, how could i do that. Kindly guide me. i am using pyspark 2.0 Thank you Kalyan 回答1: I would write/reuse stateful Hive udf and register with pySpark as Spark SQL does have good support for Hive. check this line @UDFType(deterministic = false, stateful = true) in below code to make sure it's stateful UDF. package org.apache.hadoop.hive.contrib.udf; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive

Date difference between consecutive rows - Pyspark Dataframe

阅读更多关于 Date difference between consecutive rows - Pyspark Dataframe

I have a table with following structure USER_ID Tweet_ID Date 1 1001 Thu Aug 05 19:11:39 +0000 2010 1 6022 Mon Aug 09 17:51:19 +0000 2010 1 1041 Sun Aug 19 11:10:09 +0000 2010 2 9483 Mon Jan 11 10:51:23 +0000 2012 2 4532 Fri May 21 11:11:11 +0000 2012 3 4374 Sat Jul 10 03:21:23 +0000 2013 3 4334 Sun Jul 11 04:53:13 +0000 2013 Basically what I would like to do is have a PysparkSQL query that calculates the date difference (in seconds) for consecutive records with the same user_id number. The expected result would be: 1 Sun Aug 19 11:10:09 +0000 2010 - Mon Aug 09 17:51:19 +0000 2010 839930 1 Mon