pyspark | 易学教程

pyspark将RDD转成dict

阅读更多关于 pyspark将RDD转成dict

在日常的数据处理过程，需要生成一个dict。词典数据来源主要有有：HIVE表、HDFS上的文件。 1. 从HIVE表读数据并转成dict from pyspark import SparkContext from pyspark . sql import HiveContext , SparkSession sc = SparkContext ( ) sql_context = HiveContext ( sc ) sql_data = sqlContext . sql ( "SELECT key,value from db.table" ) sql_data_rdd = sql_data . rdd . map ( lambda x : ( x [ 0 ] , x [ 1 ] ) ) my_dict = sql_data_rdd . collectAsMap ( ) 2. 从HDFS读文件并转成dict def map_2_dic ( r ) : # r 表示一行文本 filds = r . strip ( ) . split ( '\t' ) # filds[0]是key, filds[1]是value return filds [ 0 ] , filds [ 1 ] textRDD = sc . textFile ( "《your hdfs file path》" ) my

How to maintain sort order in PySpark collect_list and collect multiple lists

阅读更多关于 How to maintain sort order in PySpark collect_list and collect multiple lists

问题 I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data": I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code: from pyspark.sql import functions as F from pyspark.sql import Window w = Window.partitionBy('Syscode_Stn')

PySpark Read CSV reading incorrectly

阅读更多关于 PySpark Read CSV reading incorrectly

问题 I am trying to read a csv file into a PySpark DataFrame. However, for some reason the PySpark CSV load methods are loading significantly more rows than expected. I have tried using both the spark.read method and the spark.sql method for reaching the CSV. df = pd.read_csv("preprocessed_data.csv") len(df) # out: 318477 spark_df = spark.read.format("csv") .option("header", "true") .option("mode", "DROPMALFORMED") .load("preprocessed_data.csv") spark_df.count() # out: 6422020 df_test = spark.sql(

change values of structure dataframe

阅读更多关于 change values of structure dataframe

问题 I want to fill field structure from another existing structure A11 of my data1 will get the value of x1.f2 . I tried different manner and I didn't succeed. Please, who have an idea?. schema = StructType( [ StructField('data1', StructType([ StructField('A1', StructType([ StructField('A11', StringType(),True), StructField('A12', IntegerType(),True) ]) ), StructField('A2', IntegerType(),True) ]) )]) df = sqlCtx.createDataFrame([],schema) #Creation of df1 schema1 = StructType( [ StructField('x1',

display DataFrame when using pyspark aws glue

阅读更多关于 display DataFrame when using pyspark aws glue

问题 how can I show the DataFrame with job etl of aws glue? I tried this code below but doesn't display anything. df.show() code datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0") sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")]) sourcedf = sourcedf.toDF() data = [] schema = StructType( [ StructField('PM', StructType([

display DataFrame when using pyspark aws glue

阅读更多关于 display DataFrame when using pyspark aws glue

How to find highly similar observations in another dataset using Spark

阅读更多关于 How to find highly similar observations in another dataset using Spark

问题 I have two csv files. File 1: D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot 2,66M,J,Rock,F,1995,201211.0 3,David,HM,Lee,M,,201211.0 6,66M,,Rock,F,,201211.0 0,David,H M,Lee,,1990,201211.0 3,Marc,H,Robert,M,2000,201211.0 6,Marc,M,Robert,M,,201211.0 6,Marc,MS,Robert,M,2000,201211.0 3,David,M,Lee,,1990,201211.0 5,Paul,ABC,Row,F,2008,201211.0 3,Paul,ACB,Row,,,201211.0 4,David,,Lee,,1990,201211.0 4,66,J,Rock,,1995,201211.0 File 2: PID,FNAME,MNAME,LNAME,GENDER,DOB,FNAMELNAMEMNAMEGENDERDOB S2,66M,J,Rock,F

Is there a inbuilt function to compare RDDs on specific criteria or better to write a UDF

阅读更多关于 Is there a inbuilt function to compare RDDs on specific criteria or better to write a UDF

问题 How to do I count the occurrences of elements in child RDD occurring in Parent RDD. Say, I have two RDDs Parent RDD - ['2 3 5'] ['4 5 7'] ['5 4 2 3'] Child RDD ['2 3','5 3','4 7','5 7','5 3','2 3'] I need something like - [['2 3',2],['5 3',2],['4 7',1],['5 7',1],['5 3',2] ...] Its actually finding the frequent item candidate set from the parent set. Now, the child RDD can contain initially string elements or even lists i.e ['1 2','2 3'] or [[1,2],[2,3]] as that's the data structure that I

Pyspark in Flask

阅读更多关于 Pyspark in Flask

问题 I was trying the solution to access pyspark in this post Access to Spark from Flask app but when I tried this in my cmd ./bin/spark-submit yourfilename.py I get '.' is not recognized as an internal or external command, operable program or batch file. is there any solution to this? I tried placing the .py file inside the bin folder and run spark-submit app.py here is the result: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/03/21 01:52:00 INFO SparkContext:

Oozie pyspark action using Spark 1.6 instead of 2.2

阅读更多关于 Oozie pyspark action using Spark 1.6 instead of 2.2

问题 When run from the command line using spark2-submit its running under Spark version 2.2.0. But when i use a oozie spark action its running under Spark version 1.6.0 and failing with error TypeError: 'JavaPackage' object is not callable My oozie spark action below  <action name="foundationorder" cred="hcat"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>${hiveConfig}</job-xml> <master