rdd

How to transpose an RDD in Spark

纵饮孤独 提交于 2019-12-17 05:08:32
问题 I have an RDD like this: 1 2 3 4 5 6 7 8 9 It is a matrix. Now I want to transpose the RDD like this: 1 4 7 2 5 8 3 6 9 How can I do this? 回答1: Say you have an N×M matrix. If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy: val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9))) val transposed = sc.parallelize(rdd.collect.toSeq.transpose) If N or M is so large that you cannot hold N or M

How do I get a SQL row_number equivalent for a Spark RDD?

我只是一个虾纸丫 提交于 2019-12-17 05:03:26
问题 I need to generate a full list of row_numbers for a data table with many columns. In SQL, this would look like this: select key_value, col1, col2, col3, row_number() over (partition by key_value order by col1, col2 desc, col3) from temp ; Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like (key1, (1,2,3)) (key1, (1,4,7)) (key1, (2,2,3)) (key2, (5,5,5)) (key2, (5,5,9)) (key2, (7,5,5)) etc. I want to order these using commands like sortBy

How do I get a SQL row_number equivalent for a Spark RDD?

巧了我就是萌 提交于 2019-12-17 05:03:04
问题 I need to generate a full list of row_numbers for a data table with many columns. In SQL, this would look like this: select key_value, col1, col2, col3, row_number() over (partition by key_value order by col1, col2 desc, col3) from temp ; Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like (key1, (1,2,3)) (key1, (1,4,7)) (key1, (2,2,3)) (key2, (5,5,5)) (key2, (5,5,9)) (key2, (7,5,5)) etc. I want to order these using commands like sortBy

PySpark DataFrames - way to enumerate without converting to Pandas?

若如初见. 提交于 2019-12-17 04:07:08
问题 I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just indexes=[2,3,6,7] df[indexes] Here I want something similar, (and without converting dataframe to pandas) The closest I can get to is: Enumerating all the objects in the original dataframe by: indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes)

Spark read file from S3 using sc.textFile ("s3n://…)

陌路散爱 提交于 2019-12-17 02:29:37
问题 Trying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12 scala> myRdd.count java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614) at org.apache.hadoop.fs.FileSystem.access$200

SparkCore系列(二)rdd聚合操作,rdd之间聚合操作

限于喜欢 提交于 2019-12-15 23:59:21
一:rdd聚合操作 count val conf = new SparkConf().setAppName("HelloWorld").setMaster("local") val sc = new JavaSparkContext(conf).sc val dataLength = sc.textFile("/software/java/idea/data") .flatMap(x=>x.split("\\|")).count()//相当于数组的length println(dataLength) countByValue val initialScores1: Array[(String, Double)] = Array(("A", 88.0), ("B", 95.0), ("C", 91.0),("D", 93.0)) val data1 = sc.parallelize(initialScores1) println(data1.countByValue) // 以当前值作为key计数 reduce val conf = new SparkConf().setAppName("HelloWorld").setMaster("local") val sc = new JavaSparkContext(conf).sc val dataLength = sc.textFile

Spark RDD

让人想犯罪 __ 提交于 2019-12-15 02:32:37
scala> val rdd1 = sc.parallelize(List(63,45,89,23,144,777,888)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:15 查看该RDD的分区数量 scala> rdd1.partitions.length res0: Int = 1 创建时指定分区数量 scala> val rdd1 = sc.parallelize(List(63,45,89,23,144,777,888),3) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:15 查看分区数量 scala> rdd1.partitions.length res1: Int = 3 map/filter scala> val rdd1 = sc.parallelize(List(1,2,100,3,4)) scala> val rdd2 = rdd1.map(x => x*2).collect rdd2: Array[Int] = Array(2, 4, 200, 6, 8) 从小到大 scala> val

spark学习

懵懂的女人 提交于 2019-12-14 04:21:57
总是学了就忘记,spark都学了几遍了 总是深入不进去 唉 头疼 这里再次学习一遍 谁有更好的深入学习spark的方法给推荐推荐 面试了大数据,盘点几个被问到的问题: spark一定会把中间结果放在内存吗?当然不是 可以是内存,也可以是磁盘 spark包括work和master work和master之间的沟通通过网络RPC进行交流沟通 拷贝到其他节点 for i in {5..7}; do scp -r /bigdata/spark/conf/spark-env.sh node-$i:$PWD; done spark是移动计算,而不移动数据 因为大量数据移动成本大 spark是Scala 编写 spark包本身有Scala编译器和库 但spark是运行在jvm上的 需要安装jdk 利用zookeeper实现高可用集群 zookeeper用来1选举 2保存活跃的master信息 3 保存worker的资源信息和资源使用情况(为了故障切换转移) 在env.sh中添加export SPARK_DAEMON_JAVA_OPTS="-Dspark .deploy.recoveryMode=ZOOKEEPER xxxzookeeper相关信息" 高可用的spark需要手动启动另一个(standby)spark-master 并不会随着spark-all.sh 启动master

Dataframe state before save and after load - what's different?

巧了我就是萌 提交于 2019-12-14 03:53:17
问题 I have a DF that contains some SQL expressions (coalesce, case/when etc.). I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions. (Why I need to map/flatMap this DF is a separate question) When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem! How is the DF different before saving and after loading? In some way, the SQL expressions

How can I further reduce my Apache Spark task size

懵懂的女人 提交于 2019-12-14 03:49:32
问题 I'm trying to run the following code in scala on the Spark framework, but I get an extremely large task size (8MB) tidRDD:RDD[ItemSet] mh:MineHelper x:ItemSet broadcast_tid:Broadcast[Array[ItemSet]] count:Int tidRDD.flatMap(x => mh.mineFreqSets(x, broadcast_tid.value, count)).collect() The reason I added the MinerHelper class was to make it serialisable, and it only contains given method. An ItemSet is a class with 3 private members and a few getter/setter methods, nothing out of the ordinary