Sort by a key, but value has more than one element using Scala

会有一股神秘感。 提交于 2020-01-06 15:53:13

问题


I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:

Year, Name, County, Number

2000, JOHN, KINGS, 50

2000, BOB, KINGS, 40

2000, MARY, NASSAU, 60

2001, JOHN, KINGS, 14

2001, JANE, KINGS, 30

2001, BOB, NASSAU, 45

And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?

I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)

val names = sc.textFile("names.csv").map(l => l.split(","))

val uniqueCounty = names.map(x => x(2)).distinct.collect

for (i <- 0 to uniqueCounty.length-1) {
    val county = uniqueCounty(i).toString;
    val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
    println("County:" + county + eachCounty.first)
}

回答1:


Here is the solution using RDD. I am assuming you need top occurring name per county.

val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect

Output:

res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))



回答2:


You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.

Use spark-csv to load your csv file into a Dataframe.

 val df = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show

Gives output:

+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS|    50|
|2000| BOB| KINGS|    40|
|2000|MARY|NASSAU|    60|
|2001|JOHN| KINGS|    14|
|2001|JANE| KINGS|    30|
|2001| BOB|NASSAU|    45|
+----+----+------+------+

DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.

import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show

Gives output:

+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU|         60|
| KINGS|         50|
+------+-----------+

Is this what you are trying to achieve?

Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.

More information about Dataframes API

EDIT

For correct output:

df.registerTempTable("names")

//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
  "count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show

Gives output:

+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN|              2|
|NASSAU|MARY|              1|
+------+----+---------------+


来源:https://stackoverflow.com/questions/40011756/sort-by-a-key-but-value-has-more-than-one-element-using-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!