rdd | 易学教程

How to transpose an RDD in Spark

阅读更多关于 How to transpose an RDD in Spark

问题 I have an RDD like this: 1 2 3 4 5 6 7 8 9 It is a matrix. Now I want to transpose the RDD like this: 1 4 7 2 5 8 3 6 9 How can I do this? 回答1: Say you have an N×M matrix. If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy: val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9))) val transposed = sc.parallelize(rdd.collect.toSeq.transpose) If N or M is so large that you cannot hold N or M

How do I get a SQL row_number equivalent for a Spark RDD?

阅读更多关于 How do I get a SQL row_number equivalent for a Spark RDD?

问题 I need to generate a full list of row_numbers for a data table with many columns. In SQL, this would look like this: select key_value, col1, col2, col3, row_number() over (partition by key_value order by col1, col2 desc, col3) from temp ; Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like (key1, (1,2,3)) (key1, (1,4,7)) (key1, (2,2,3)) (key2, (5,5,5)) (key2, (5,5,9)) (key2, (7,5,5)) etc. I want to order these using commands like sortBy

How do I get a SQL row_number equivalent for a Spark RDD?

阅读更多关于 How do I get a SQL row_number equivalent for a Spark RDD?

PySpark DataFrames - way to enumerate without converting to Pandas?

阅读更多关于 PySpark DataFrames - way to enumerate without converting to Pandas?

问题 I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just indexes=[2,3,6,7] df[indexes] Here I want something similar, (and without converting dataframe to pandas) The closest I can get to is: Enumerating all the objects in the original dataframe by: indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes)

Spark read file from S3 using sc.textFile ("s3n://…)

阅读更多关于 Spark read file from S3 using sc.textFile ("s3n://…)

问题 Trying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12 scala> myRdd.count java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614) at org.apache.hadoop.fs.FileSystem.access$200

SparkCore系列(二)rdd聚合操作,rdd之间聚合操作

阅读更多关于 SparkCore系列(二)rdd聚合操作,rdd之间聚合操作

一：rdd聚合操作 count val conf = new SparkConf().setAppName("HelloWorld").setMaster("local") val sc = new JavaSparkContext(conf).sc val dataLength = sc.textFile("/software/java/idea/data") .flatMap(x=>x.split("\\|")).count()//相当于数组的length println(dataLength) countByValue val initialScores1: Array[(String, Double)] = Array(("A", 88.0), ("B", 95.0), ("C", 91.0),("D", 93.0)) val data1 = sc.parallelize(initialScores1) println(data1.countByValue) // 以当前值作为key计数 reduce val conf = new SparkConf().setAppName("HelloWorld").setMaster("local") val sc = new JavaSparkContext(conf).sc val dataLength = sc.textFile

Spark RDD

阅读更多关于 Spark RDD

scala> val rdd1 = sc.parallelize(List(63,45,89,23,144,777,888)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:15 查看该RDD的分区数量 scala> rdd1.partitions.length res0: Int = 1 创建时指定分区数量 scala> val rdd1 = sc.parallelize(List(63,45,89,23,144,777,888),3) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:15 查看分区数量 scala> rdd1.partitions.length res1: Int = 3 map/filter scala> val rdd1 = sc.parallelize(List(1,2,100,3,4)) scala> val rdd2 = rdd1.map(x => x*2).collect rdd2: Array[Int] = Array(2, 4, 200, 6, 8) 从小到大 scala> val

spark学习

阅读更多关于 spark学习

总是学了就忘记，spark都学了几遍了总是深入不进去唉头疼这里再次学习一遍谁有更好的深入学习spark的方法给推荐推荐面试了大数据，盘点几个被问到的问题： spark一定会把中间结果放在内存吗？当然不是可以是内存，也可以是磁盘 spark包括work和master work和master之间的沟通通过网络RPC进行交流沟通拷贝到其他节点 for i in {5..7}; do scp -r /bigdata/spark/conf/spark-env.sh node-$i:$PWD; done spark是移动计算，而不移动数据因为大量数据移动成本大 spark是Scala 编写 spark包本身有Scala编译器和库但spark是运行在jvm上的需要安装jdk 利用zookeeper实现高可用集群 zookeeper用来1选举 2保存活跃的master信息 3 保存worker的资源信息和资源使用情况（为了故障切换转移）在env.sh中添加export SPARK_DAEMON_JAVA_OPTS="-Dspark .deploy.recoveryMode=ZOOKEEPER xxxzookeeper相关信息" 高可用的spark需要手动启动另一个(standby)spark-master 并不会随着spark-all.sh 启动master

Dataframe state before save and after load - what's different?

阅读更多关于 Dataframe state before save and after load - what's different?

问题 I have a DF that contains some SQL expressions (coalesce, case/when etc.). I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions. (Why I need to map/flatMap this DF is a separate question) When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem! How is the DF different before saving and after loading? In some way, the SQL expressions

How can I further reduce my Apache Spark task size

阅读更多关于 How can I further reduce my Apache Spark task size

问题 I'm trying to run the following code in scala on the Spark framework, but I get an extremely large task size (8MB) tidRDD:RDD[ItemSet] mh:MineHelper x:ItemSet broadcast_tid:Broadcast[Array[ItemSet]] count:Int tidRDD.flatMap(x => mh.mineFreqSets(x, broadcast_tid.value, count)).collect() The reason I added the MinerHelper class was to make it serialisable, and it only contains given method. An ItemSet is a class with 3 private members and a few getter/setter methods, nothing out of the ordinary