pyspark

pyspark将RDD转成dict

孤人 提交于 2020-01-17 13:02:24
在日常的数据处理过程,需要生成一个dict。 词典数据来源主要有有:HIVE表、HDFS上的文件。 1. 从HIVE表读数据并转成dict from pyspark import SparkContext from pyspark . sql import HiveContext , SparkSession sc = SparkContext ( ) sql_context = HiveContext ( sc ) sql_data = sqlContext . sql ( "SELECT key,value from db.table" ) sql_data_rdd = sql_data . rdd . map ( lambda x : ( x [ 0 ] , x [ 1 ] ) ) my_dict = sql_data_rdd . collectAsMap ( ) 2. 从HDFS读文件并转成dict def map_2_dic ( r ) : # r 表示一行文本 filds = r . strip ( ) . split ( '\t' ) # filds[0]是key, filds[1]是value return filds [ 0 ] , filds [ 1 ] textRDD = sc . textFile ( "《your hdfs file path》" ) my

How to maintain sort order in PySpark collect_list and collect multiple lists

余生颓废 提交于 2020-01-17 00:28:44
问题 I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data": I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code: from pyspark.sql import functions as F from pyspark.sql import Window w = Window.partitionBy('Syscode_Stn')

PySpark Read CSV reading incorrectly

老子叫甜甜 提交于 2020-01-16 20:18:34
问题 I am trying to read a csv file into a PySpark DataFrame. However, for some reason the PySpark CSV load methods are loading significantly more rows than expected. I have tried using both the spark.read method and the spark.sql method for reaching the CSV. df = pd.read_csv("preprocessed_data.csv") len(df) # out: 318477 spark_df = spark.read.format("csv") .option("header", "true") .option("mode", "DROPMALFORMED") .load("preprocessed_data.csv") spark_df.count() # out: 6422020 df_test = spark.sql(

change values of structure dataframe

风流意气都作罢 提交于 2020-01-16 19:37:06
问题 I want to fill field structure from another existing structure A11 of my data1 will get the value of x1.f2 . I tried different manner and I didn't succeed. Please, who have an idea?. schema = StructType( [ StructField('data1', StructType([ StructField('A1', StructType([ StructField('A11', StringType(),True), StructField('A12', IntegerType(),True) ]) ), StructField('A2', IntegerType(),True) ]) )]) df = sqlCtx.createDataFrame([],schema) #Creation of df1 schema1 = StructType( [ StructField('x1',

display DataFrame when using pyspark aws glue

自古美人都是妖i 提交于 2020-01-16 19:34:29
问题 how can I show the DataFrame with job etl of aws glue? I tried this code below but doesn't display anything. df.show() code datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0") sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")]) sourcedf = sourcedf.toDF() data = [] schema = StructType( [ StructField('PM', StructType([

display DataFrame when using pyspark aws glue

对着背影说爱祢 提交于 2020-01-16 19:34:09
问题 how can I show the DataFrame with job etl of aws glue? I tried this code below but doesn't display anything. df.show() code datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0") sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")]) sourcedf = sourcedf.toDF() data = [] schema = StructType( [ StructField('PM', StructType([

How to find highly similar observations in another dataset using Spark

爱⌒轻易说出口 提交于 2020-01-16 18:23:50
问题 I have two csv files. File 1: D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot 2,66M,J,Rock,F,1995,201211.0 3,David,HM,Lee,M,,201211.0 6,66M,,Rock,F,,201211.0 0,David,H M,Lee,,1990,201211.0 3,Marc,H,Robert,M,2000,201211.0 6,Marc,M,Robert,M,,201211.0 6,Marc,MS,Robert,M,2000,201211.0 3,David,M,Lee,,1990,201211.0 5,Paul,ABC,Row,F,2008,201211.0 3,Paul,ACB,Row,,,201211.0 4,David,,Lee,,1990,201211.0 4,66,J,Rock,,1995,201211.0 File 2: PID,FNAME,MNAME,LNAME,GENDER,DOB,FNAMELNAMEMNAMEGENDERDOB S2,66M,J,Rock,F

Is there a inbuilt function to compare RDDs on specific criteria or better to write a UDF

自作多情 提交于 2020-01-16 13:13:13
问题 How to do I count the occurrences of elements in child RDD occurring in Parent RDD. Say, I have two RDDs Parent RDD - ['2 3 5'] ['4 5 7'] ['5 4 2 3'] Child RDD ['2 3','5 3','4 7','5 7','5 3','2 3'] I need something like - [['2 3',2],['5 3',2],['4 7',1],['5 7',1],['5 3',2] ...] Its actually finding the frequent item candidate set from the parent set. Now, the child RDD can contain initially string elements or even lists i.e ['1 2','2 3'] or [[1,2],[2,3]] as that's the data structure that I

Pyspark in Flask

和自甴很熟 提交于 2020-01-16 11:11:12
问题 I was trying the solution to access pyspark in this post Access to Spark from Flask app but when I tried this in my cmd ./bin/spark-submit yourfilename.py I get '.' is not recognized as an internal or external command, operable program or batch file. is there any solution to this? I tried placing the .py file inside the bin folder and run spark-submit app.py here is the result: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/03/21 01:52:00 INFO SparkContext:

Oozie pyspark action using Spark 1.6 instead of 2.2

孤者浪人 提交于 2020-01-16 09:43:49
问题 When run from the command line using spark2-submit its running under Spark version 2.2.0. But when i use a oozie spark action its running under Spark version 1.6.0 and failing with error TypeError: 'JavaPackage' object is not callable My oozie spark action below <!-- Spark action first --> <action name="foundationorder" cred="hcat"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>${hiveConfig}</job-xml> <master