rdd

pyspark join rdds by a specific key

末鹿安然 提交于 2019-12-23 16:49:01
问题 I have two rdds that I need to join them together. They look like the followings: RDD1 [(u'2', u'100', 2), (u'1', u'300', 1), (u'1', u'200', 1)] RDD2 [(u'1', u'2'), (u'1', u'3')] My desired output is: [(u'1', u'2', u'100', 2)] So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys. Thanks 回答1:

Spark - Group by Key then Count by Value

别等时光非礼了梦想. 提交于 2019-12-23 15:43:54
问题 I have non-unique key-value pairs that I have created using the map function from an RDD Array[String] val kvPairs = myRdd.map(line => (line(0), line(1))) This produces data of format: 1, A 1, A 1, B 2, C I would like to group all of they keys by their values and provide the counts for these values like so: 1, {(A, 2), (B, 1)} 2, {(C, 1)} I have tried many different attempts, but the closest I can get is with something like this: kvPairs.sortByKey().countByValue() This gives 1, (A, 2) 1, (B,

spark - scala: not a member of org.apache.spark.sql.Row

筅森魡賤 提交于 2019-12-23 08:25:33
问题 I am trying to convert a data frame to RDD, then perform some operations below to return tuples: df.rdd.map { t=> (t._2 + "_" + t._3 , t) }.take(5) Then I got the error below. Anyone have any ideas? Thanks! <console>:37: error: value _2 is not a member of org.apache.spark.sql.Row (t._2 + "_" + t._3 , t) ^ 回答1: When you convert a DataFrame to RDD, you get an RDD[Row] , so when you use map , your function receives a Row as parameter. Therefore, you must use the Row methods to access its members

Parsing a text file to split at specific positions using pyspark

微笑、不失礼 提交于 2019-12-23 04:39:07
问题 I have a text file which is not delimited by any character and I want to split it at specific positions so that I can convert it to a 'dataframe'.Example data in file1.txt below: 1JITENDER33 2VIRENDER28 3BIJENDER37 I want to split the file so that positions 0 to 1 goes into first column, positions 2 to 9 goes to second column and 10 to 11 goes to third column so that I can finally convert it into a spark dataframe . 回答1: you can use a below python code to read onto your input file and make it

create a hive table from list of case class using spark

假装没事ソ 提交于 2019-12-23 04:30:33
问题 I am trying to create a hive table from the list of case class. But it does not allow to specify the database name. Below error is being thrown. Spark Version: 1.6.2 Error: diagnostics: User class threw exception: org.apache.spark.sql.AnalysisException: Table not found: mytempTable; line 1 pos 58 Please let me know the way to save the output of map method to a hive table withe same structure as case class. Note: recordArray list is being populated in the map method (in getElem() method infact

Spark(Hive) SQL数据类型使用详解(Python)

被刻印的时光 ゝ 提交于 2019-12-23 03:15:33
Spark SQL使用时需要有若干“表”的存在,这些“表”可以来自于Hive,也可以来自“临时表”。如果“表”来自于Hive,它的模式(列名、列类型等)在创建时已经确定,一般情况下我们直接通过Spark SQL分析表中的数据即可;如果“表”来自“临时表”,我们就需要考虑两个问题: (1)“临时表”的数据是哪来的? (2)“临时表”的模式是什么? 通过Spark的官方文档可以了解到,生成一张“临时表”需要两个要素: (1)关联着数据的RDD; (2)数据模式; 也就是说,我们需要将数据模式应用于关联着数据的RDD,然后就可以将该RDD注册为一张“临时表”。在这个过程中,最为重要的就是数据(模式)的数据类型,它直接影响着Spark SQL计算过程以及计算结果的正确性。 目前pyspark.sql.types支持的数据类型:NullType、StringType、BinaryType、BooleanType、DateType、TimestampType、DecimalType、DoubleType、FloatType、ByteType、IntegerType、LongType、ShortType、ArrayType、MapType、StructType(StructField),其中ArrayType、MapType、StructType我们称之为“复合类型”,其余称之为“基本类型”,

spark sql之RDD转换DataSet

六月ゝ 毕业季﹏ 提交于 2019-12-23 01:44:46
简介   Spark SQL提供了两种方式用于将RDD转换为Dataset。 使用反射机制推断RDD的数据结构   当spark应用可以推断RDD数据结构时,可使用这种方式。这种基于反射的方法可以使代码更简洁有效。 通过编程接口构造一个数据结构,然后映射到RDD上   当spark应用无法推断RDD数据结构时,可使用这种方式。 反射方式 scala // For implicit conversions from RDDs to DataFrames import spark.implicits._ // Create an RDD of Person objects from a text file, convert it to a Dataframe val peopleDF = spark.sparkContext .textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) .toDF() // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people"

Save and load Spark RDD from local binary file - minimal working example

不打扰是莪最后的温柔 提交于 2019-12-22 17:51:44
问题 I am working on a Spark app in which an RDD is first calculated, then need to be stored to disk, and then loaded again into Spark. To this end, I am looking for a minimal working example of saving an RDD to a local file and then loading it. The file format is not suitable for text conversion, so saveAsTextFile won't fly. The RDD can either be a plain RDD or Pair RDD, it is not crucial. The file format can be either of HDFS or not. The example can be either in Java or Scala. Thanks! 回答1: As

Joining two RDD[String] -Spark Scala

我与影子孤独终老i 提交于 2019-12-22 13:49:58
问题 I have two RDDS : rdd1 [String,String,String]: Name, Address, Zipcode rdd2 [String,String,String]: Name, Address, Landmark I am trying to join these 2 RDDs using the function : rdd1.join(rdd2) But I am getting an error : error: value fullOuterJoin is not a member of org.apache.spark.rdd.RDD[String] The join should join the RDD[String] and the output RDD should be something like : rddOutput : Name,Address,Zipcode,Landmark And I wanted to save these files as a JSON file in the end. Can someone

what is the difference between spark javardd methods collect() & collectAsync()?

不问归期 提交于 2019-12-22 12:21:34
问题 I am exploring the spark 2.0 java api and have a doubt regarding collect() & collectAsync() available for javardd. 回答1: Collect action is basically used to view the content of RDD, basically it is synchronous while collectAsync() is asynchronous meaning it Returns a future for retrieving all elements of this RDD. it allows to run other RDD to run in parallel. for better optimization you can utilize fair scheduler for job scheduling. 回答2: collect(): It returns an array that contains all of the