pyspark-sql

Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

不打扰是莪最后的温柔 提交于 2019-12-01 17:53:27
Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so: feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \ outputCol="features") pipeline = Pipeline(stages=[feature_assembler]) model = pipeline.fit(df_train) model_output = model.transform(df_train) When I try to look at the output using model_output.select("features").show(1) I get the error Py4JJavaError Traceback (most recent call last) <ipython-input-95-7a3e3d4f281c> in <module>() 2 3 ----> 4 model_output.select("features").show(1) /usr/local/spark20/python/pyspark/sql

Fill Pyspark dataframe column null values with average value from same column

走远了吗. 提交于 2019-12-01 15:45:36
问题 With a dataframe like this, rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")]) df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"]) df_data.show() +---+----+----+-------+ | id|type|cost| date| +---+----+----+-------+ | 0| 10| 223| 201601| | 0| 10| 83|2016032| | 1| 20|null| 201602| | 1| 20|3003| 201601| | 1| 20|null| 201603| |

Filter array column content

99封情书 提交于 2019-12-01 12:31:28
问题 I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf: >>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"]) >>> df.show() +----+----+---------------+ |col1|col2| col3| +----+----+---------------+ | 1| A| [1, 2, 3, 4]| | 2| B|[1, 2, 3, 4, 5]| +----+----+---------------+ The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are

Does Apache Spark load entire data from target database?

☆樱花仙子☆ 提交于 2019-12-01 10:49:05
问题 I want to use Apache Spark and connect to Vertica by JDBC. In Vertica database, I have 100 million records and spark code runs on another server. When I run the query in Spark and monitor network usage, traffic between two servers is very high. It seems Spark loads all data from target server. this is my code: test_df = spark.read.format("jdbc") .option("url" , url).option("dbtable", "my_table") .option("user", "user").option("password" , "pass").load() test_df.createOrReplaceTempView('tb')

Converting yyyymmdd to MM-dd-yyyy format in pyspark

拈花ヽ惹草 提交于 2019-12-01 10:45:45
I have a large data frame df containing a column for date in the format yyyymmdd , how can I convert it into MM-dd-yyyy in pySpark. from datetime import datetime from pyspark.sql.functions import col,udf from pyspark.sql.types import DateType rdd = sc.parallelize(['20161231', '20140102', '20151201', '20161124']) df1 = sqlContext.createDataFrame(rdd, ['old_col']) //UDF to convert string to date func = udf (lambda x: datetime.strptime(x, '%Y%M%d'), DateType()) df = df1.withColumn('new_col', date_format(func(col('old_col')), 'MM-dd-yyy')) df.show() This is also working: from datetime import

Converting yyyymmdd to MM-dd-yyyy format in pyspark

天涯浪子 提交于 2019-12-01 08:50:58
问题 I have a large data frame df containing a column for date in the format yyyymmdd , how can I convert it into MM-dd-yyyy in pySpark. 回答1: from datetime import datetime from pyspark.sql.functions import col,udf from pyspark.sql.types import DateType rdd = sc.parallelize(['20161231', '20140102', '20151201', '20161124']) df1 = sqlContext.createDataFrame(rdd, ['old_col']) //UDF to convert string to date func = udf (lambda x: datetime.strptime(x, '%Y%M%d'), DateType()) df = df1.withColumn('new_col'

is there any pyspark function for add next month like DATE_ADD(date, month(int type))

感情迁移 提交于 2019-12-01 08:16:09
问题 I am new in spark , is there any built in function which will show next month date from current date like today is 27-12-2016 then the function will return 27-01-2017. i have used date_add() but no function for adding month. I have tried date_add(date, 31)but what if the month has 30 days . spark.sql("select date_add(current_date(),31)") .show() could anyone help me about this problem. do i need to write custom function for that ? cause i didn't find any built in code still Thanks in advance

How to cast string to ArrayType of dictionary (JSON) in PySpark

北慕城南 提交于 2019-12-01 07:06:23
问题 Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV. Using pyspark on Spark2 The CSV file I am dealing with; is as follows - date,attribute2,count,attribute3 2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]' 2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]' As shown above, it contains one attribute "attribute3" in literal string, which is

How to implement auto increment in spark SQL(PySpark)

丶灬走出姿态 提交于 2019-12-01 05:51:49
问题 I need to implement a auto increment column in my spark sql table, how could i do that. Kindly guide me. i am using pyspark 2.0 Thank you Kalyan 回答1: I would write/reuse stateful Hive udf and register with pySpark as Spark SQL does have good support for Hive. check this line @UDFType(deterministic = false, stateful = true) in below code to make sure it's stateful UDF. package org.apache.hadoop.hive.contrib.udf; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive

Date difference between consecutive rows - Pyspark Dataframe

拜拜、爱过 提交于 2019-12-01 04:35:54
I have a table with following structure USER_ID Tweet_ID Date 1 1001 Thu Aug 05 19:11:39 +0000 2010 1 6022 Mon Aug 09 17:51:19 +0000 2010 1 1041 Sun Aug 19 11:10:09 +0000 2010 2 9483 Mon Jan 11 10:51:23 +0000 2012 2 4532 Fri May 21 11:11:11 +0000 2012 3 4374 Sat Jul 10 03:21:23 +0000 2013 3 4334 Sun Jul 11 04:53:13 +0000 2013 Basically what I would like to do is have a PysparkSQL query that calculates the date difference (in seconds) for consecutive records with the same user_id number. The expected result would be: 1 Sun Aug 19 11:10:09 +0000 2010 - Mon Aug 09 17:51:19 +0000 2010 839930 1 Mon