pyspark

Using Spark to expand JSON string by rows and columns

倖福魔咒の 提交于 2020-04-30 05:45:50
问题 I'm new to Spark and working with JSON and I'm having trouble doing something fairly simple (I think). I've tried using parts of solutions to similar questions but can't quite get it right. I currently have a Spark dataframe with several columns representing variables. Each row is a unique combination of variable values. I then have a UDF that is applied to every row which takes each of the columns as input, does some analysis, and outputs a summary table as a JSON string for each row, and

Using Spark to expand JSON string by rows and columns

我的梦境 提交于 2020-04-30 05:45:45
问题 I'm new to Spark and working with JSON and I'm having trouble doing something fairly simple (I think). I've tried using parts of solutions to similar questions but can't quite get it right. I currently have a Spark dataframe with several columns representing variables. Each row is a unique combination of variable values. I then have a UDF that is applied to every row which takes each of the columns as input, does some analysis, and outputs a summary table as a JSON string for each row, and

Hourly Aggregation in PySpark

假如想象 提交于 2020-04-30 05:20:35
问题 I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this: Row(access=u'WRITE', agentHost=u'xxxxxx50.haas.xxxxxx', cliIP=u'192.000.00.000', enforcer=u'ranger-acl', event_count=1, event_dur_ms=0, evtTime=u'2017-10-01 23:03:51.337', id=u'a43d824c-1e53-439b-b374-96b76bacf714', logType=u'RangerAudit', policy=699, reason=u'/project-h/xxxx/xxxx/warehouse/rocq.db/f_crcm_res_temps_retrait', repoType=1, reqUser=u'rocqphadm',

Hourly Aggregation in PySpark

孤街醉人 提交于 2020-04-30 05:17:05
问题 I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this: Row(access=u'WRITE', agentHost=u'xxxxxx50.haas.xxxxxx', cliIP=u'192.000.00.000', enforcer=u'ranger-acl', event_count=1, event_dur_ms=0, evtTime=u'2017-10-01 23:03:51.337', id=u'a43d824c-1e53-439b-b374-96b76bacf714', logType=u'RangerAudit', policy=699, reason=u'/project-h/xxxx/xxxx/warehouse/rocq.db/f_crcm_res_temps_retrait', repoType=1, reqUser=u'rocqphadm',

Pyspark alter column with substring

|▌冷眼眸甩不掉的悲伤 提交于 2020-04-29 12:13:32
问题 Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string. from pyspark.sql.functions import substring import pandas as pd pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']}) # this is what i'm looking for... pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] df = sqlContext.createDataFrame(pdf) # following not working... COLUMN_NAME_fix is blank df.withColumn('COLUMN_NAME_fix',

Pyspark alter column with substring

北城以北 提交于 2020-04-29 12:13:06
问题 Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string. from pyspark.sql.functions import substring import pandas as pd pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']}) # this is what i'm looking for... pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] df = sqlContext.createDataFrame(pdf) # following not working... COLUMN_NAME_fix is blank df.withColumn('COLUMN_NAME_fix',

Spark Data Frame Random Splitting

青春壹個敷衍的年華 提交于 2020-04-28 05:52:52
问题 I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20. I used the following code for the same: def data_split(x): global data_map_var d_map = data_map_var.value data_row = x.asDict() import random rand = random.uniform(0.0,1.0) ret_list = () if rand <= 0.6: ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'train') elif rand <=0.8: ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings']

Spark Data Frame Random Splitting

那年仲夏 提交于 2020-04-28 05:49:23
问题 I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20. I used the following code for the same: def data_split(x): global data_map_var d_map = data_map_var.value data_row = x.asDict() import random rand = random.uniform(0.0,1.0) ret_list = () if rand <= 0.6: ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'train') elif rand <=0.8: ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings']

Filtering a pyspark dataframe using isin by exclusion [duplicate]

可紊 提交于 2020-04-27 19:46:51
问题 This question already has answers here : Pyspark dataframe operator “IS NOT IN” (6 answers) Closed last year . I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar')) I get the data frame: +---+---+ | id|bar| +---+---+ | 1| a| | 2| b| | 3| b| | 4| c| | 5| d| +---+---+ I only want to exclude rows where bar is ('a

Pyspark retain only distinct (drop all duplicates)

放肆的年华 提交于 2020-04-18 08:40:27
问题 After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate) I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long. Is there a simpler method like dropDuplicates() but where none of the