spark-dataframe

Pyspark Unsupported literal type class java.util.ArrayList [duplicate]

十年热恋 提交于 2019-12-08 12:38:54
问题 This question already has answers here : Passing a data frame column and external list to udf under withColumn (3 answers) Closed last year . I am using python3 on Spark(2.2.0). I want to apply my UDF to a specified list of strings. df = ['Apps A','Chrome', 'BBM', 'Apps B', 'Skype'] def calc_app(app, app_list): browser_list = ['Chrome', 'Firefox', 'Opera'] chat_list = ['WhatsApp', 'BBM', 'Skype'] sum = 0 for data in app: name = data['name'] if name in app_list: sum += 1 return sum calc_appUDF

How to get list of file from Azure blob using Spark/Scala?

可紊 提交于 2019-12-08 12:31:00
问题 How to get list of file from Azure blob storage in Spark and Scala. I am not getting any idea to approach this. 回答1: I don't know the Spark you used is either on Azure or on local. So they are two cases, but similar. For Spark running on local, there is an offical blog which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your

Adding previous row with current row using Window function

匆匆过客 提交于 2019-12-08 11:44:20
问题 I have a spark dataframe where, I want to calculate a running total based on current row Amount value and Previous row sum of Amount value based on groupid and id. Let me put out the df import findspark findspark.init() import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() import pandas as pd sc = spark.sparkContext data1 = {'date': {0: '2018-04-03', 1: '2018-04-04', 2: '2018-04-05', 3: '2018-04-06', 4: '2018-04-07'}, 'id': {0: 'id1', 1: 'id2', 2: 'id1

How to append keys to values for {Key,Value} pair RDD and How to convert it to an rdd? [duplicate]

青春壹個敷衍的年華 提交于 2019-12-08 11:42:20
问题 This question already has answers here : Spark-Obtaining file name in RDDs (7 answers) Closed 2 years ago . Suppose i am having 2 files in file1,file2 in dataset directory: val file = sc.wholeTextFiles("file:///root/data/dataset").map((x,y) => y + "," + x) in the Above code i am trying to get an rdd having values:-> value,key as single value into rdd suppose filename is file1 and say 2 records: file1: 1,30,ssr 2,43,svr And file2: 1,30,psr 2,43,pvr The desired rdd output is: (1,30,ssr,file1),

pyspark.sql.utils.IllegalArgumentException: u'Field “features” does not exist.'

安稳与你 提交于 2019-12-08 11:35:36
问题 I am trying to execute Random Forest Classifier and evaluate the model using Cross Validation. I work with pySpark. The input CSV file is loaded as Spark DataFrame format. But I face a issue while constructing the model. Below is the code. from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.ml import Pipeline from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import

Select latest timestamp record after a window operation for every group in the data with Spark Scala

懵懂的女人 提交于 2019-12-08 11:33:22
问题 I ran a count of attempts by (user,app) over a time window of day(86400). I want to extract the rows with latest timestamp with the count and remove unnecessary previous counts. Make sure your answer considers the time window. One user with 1 device can do make multiple attempts a day or a week, I wanna be able to retrieve those particular moments with the final count in every specific window. My intial dataset is like this: val df = sc.parallelize(Seq( ("user1", "iphone", "2017-12-22 10:06

Spark Data Frames - Check if column is of type integer

扶醉桌前 提交于 2019-12-08 10:52:18
问题 I am trying to figure out what data type my column in a spark data frame is and manipulate the column based on that dedeuction. Here is what I have so far: import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName('MyApp').getOrCreate() df = spark.read.csv('Path To csv File',inferSchema=True,header=True) for x in df.columns: if type(x) == 'integer': print(x+": inside if loop") The print(x+": inside if loop") statement never seems to get executed but I am sure

Partition pyspark dataframe based on the change in column value

折月煮酒 提交于 2019-12-08 07:54:39
问题 I have a dataframe in pyspark. Say the has some columns a,b,c... I want to group the data into groups as the value of column changes. Say A B 1 x 1 y 0 x 0 y 0 x 1 y 1 x 1 y There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y) And corresponding row data 回答1: If I understand correctly you want to create a distinct group every time column A changes values. First we'll create a monotonically increasing id to keep the row order as it is: import pyspark.sql.functions as psf df = sc.parallelize(

Spark Dataframe count function and many more functions throw IndexOutOfBoundsException

核能气质少年 提交于 2019-12-08 07:37:23
问题 1) Initially filtered RDD with null values. val rddWithOutNull2 = rddSlices.filter(x => x(0) != null) 2) Then converted this RDD to RDD of Row 3) After converting RDD to Dataframe using Scala : val df = spark.createDataFrame(rddRow,schema) df.printSchema() Output: root |-- name: string (nullable = false) println(df.count()) Output: Error : count : : [Stage 11:==================================> (3 + 2) / 5][error] o.a.s.e.Executor - Exception in task 4.0 in stage 11.0 (TID 16) java.lang

Spark to read a big file as inputstream

拥有回忆 提交于 2019-12-08 06:19:27
问题 I know spark built in method can have partition and read huge chunk of file and distributed as rdd using textfile. However, i am reading this in a customized encrytped filessytem which spark does not support by nature. One way i can think of is to read an inputstream instead and loads multiple lines and distributed to executor. Keep reading until all file is loaded. So no executor will blow up due to out of memory error. Is that possible to do this in spark? 回答1: you can try lines.take(n) for