rdd

Insert Spark dataframe into hbase

折月煮酒 提交于 2019-12-13 00:18:31
问题 I have a dataframe and I want to insert it into hbase. I follow this documenation . This is how my dataframe look like: -------------------- |id | name | address | |--------------------| |23 |marry |france | |--------------------| |87 |zied |italie | -------------------- I create a hbase table using this code: val tableName = "two" val conf = HBaseConfiguration.create() if(!admin.isTableAvailable(tableName)) { print("----------------------------------------------------------------------------

In Apache Spark, how to make an RDD/DataFrame operation lazy?

感情迁移 提交于 2019-12-12 18:36:10
问题 Assuming that I would like to write a function foo that transforms a DataFrame: object Foo { def foo(source: DataFrame): DataFrame = { ...complex iterative algorithm with a stopping condition... } } since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution. This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation

Is there a better way for reduce operation on RDD[Array[Double]]

走远了吗. 提交于 2019-12-12 18:22:03
问题 I want to reduce a RDD[Array[Double]] in order to each element of the array will be add with the same element of the next array. I use this code for the moment : var rdd1 = RDD[Array[Double]] var coord = rdd1.reduce( (x,y) => { (x, y).zipped.map(_+_) }) Is there a better way to make this more efficiently because it cost a harm. 回答1: Using zipped.map is very inefficient, because it creates a lot of temporary objects and boxes the doubles. If you use spire, you can just do this > import spire

Filter from Cassandra table by RDD values

↘锁芯ラ 提交于 2019-12-12 16:26:46
问题 I'd like to query some data from Cassandra based on values I have in an RDD. My approach is the following: val userIds = sc.textFile("/tmp/user_ids").keyBy( e => e ) val t = sc.cassandraTable("keyspace", "users").select("userid", "user_name") val userNames = userIds.flatMap { userId => t.where("userid = ?", userId).take(1) } userNames.take(1) While the Cassandra query works in Spark shell, it throws an exception when I used it inside flatMap: org.apache.spark.SparkException: Job aborted due

How to iterate over large Cassandra table in small chunks in Spark

妖精的绣舞 提交于 2019-12-12 15:52:12
问题 In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB. CREATE TABLE foo ( uid timeuuid, events blob, PRIMARY KEY ((uid)) ) Here is scala code that is executed at spark cluster val rdd = sc.cassandraTable("test", "foo") // This pulls records in memory, taking ~6.3GB var count = rdd.select("events").count() // Fails nearly immediately with // NoHostAvailableException: All host(s)

How to filter a RDD according to a function based another RDD in Spark?

╄→尐↘猪︶ㄣ 提交于 2019-12-12 15:30:36
问题 I am a beginner of Apache Spark. I want to filter out all groups whose sum of weight is larger than a constant value in a RDD. The "weight" map is also a RDD. Here is a small-size demo, the groups to be filtered is stored in "groups", the constant value is 12: val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g")) val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6))) val wm = weights.toArray.toMap def isheavy(inp: String):

read use spark phoenix from table to rdd partition number is 1

久未见 提交于 2019-12-12 12:57:13
问题 When I ran my spark code: val sqlContext = spark.sqlContext val noact_table = primaryDataProcessor.getTableData(sqlContext, zookeeper, tableName) println("noact_table.rdd:"+noact_table.rdd.partitions.size) val tmp = noact_table.rdd println(tmp.partitions.size) val out = tmp.map(x => x(0) + "," + x(1)) HdfsOperator.writeHdfsFile(out, "/tmp/test/push") getTableData: def getTableData(sqlContext: SQLContext, zkUrl: String, tableName: String): DataFrame = { val tableData = sqlContext.read.format(

Apache Spark Method returning an RDD (with Tail Recursion)

…衆ロ難τιáo~ 提交于 2019-12-12 11:32:05
问题 An RDD has a lineage and therefore does not exist until an action if performed on it; so, if I have a method which performs numerous transformations on the RDD and returns a transformed RDD then what am I actually returning? Am I returning nothing until that RDD is required for an action? If I cached an RDD in the method, does it persist in the cache? I think I know the answer to this being: the method will only be run when the action is called on the RDD which is returned? But I could be

Difference between RDDs and Batches in Spark?

大城市里の小女人 提交于 2019-12-12 11:32:00
问题 RDD is a collection of elements partitioned across the nodes of the cluster. It's core component and abstraction. Batches: SparkStreaming API simply divides the data into batches, that batches also same collection of Streaming objects/elements. Based on requirement a set of batches defined in the form time based batch window and intensive online activity based batch window . What is the difference between Rdd & Batches exactly? 回答1: RDD s and batches are essentially different but related

Fetching esJsonRDD from elasticsearch with complex filtering in Spark

大城市里の小女人 提交于 2019-12-12 10:23:19
问题 I am currently fetching the elasticsearch RDD in our Spark Job filtering based on one-line elastic query as such (example): val elasticRdds = sparkContext.esJsonRDD(esIndex, s"?default_operator=AND&q=director.name:DAVID + \n movie.name:SEVEN") Now if our search query becomes complex like: { "query": { "filtered": { "query": { "query_string": { "default_operator": "AND", "query": "director.name:DAVID + \n movie.name:SEVEN" } }, "filter": { "nested": { "path": "movieStatus.boxoffice.status",