Sort by a key, but value has more than one element using Scala

问题

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:

Year, Name, County, Number

2000, JOHN, KINGS, 50

2000, BOB, KINGS, 40

2000, MARY, NASSAU, 60

2001, JOHN, KINGS, 14

2001, JANE, KINGS, 30

2001, BOB, NASSAU, 45

And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?

I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)

val names = sc.textFile("names.csv").map(l => l.split(","))

val uniqueCounty = names.map(x => x(2)).distinct.collect

for (i <- 0 to uniqueCounty.length-1) {
    val county = uniqueCounty(i).toString;
    val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
    println("County:" + county + eachCounty.first)
}

回答1:

Here is the solution using RDD. I am assuming you need top occurring name per county.

val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect

Output:

res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))

回答2:

You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.

Use spark-csv to load your csv file into a Dataframe.

 val df = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show

Gives output:

+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS|    50|
|2000| BOB| KINGS|    40|
|2000|MARY|NASSAU|    60|
|2001|JOHN| KINGS|    14|
|2001|JANE| KINGS|    30|
|2001| BOB|NASSAU|    45|
+----+----+------+------+

DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.

import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show

Gives output:

+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU|         60|
| KINGS|         50|
+------+-----------+

Is this what you are trying to achieve?

Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.

More information about Dataframes API

EDIT

For correct output:

df.registerTempTable("names")

//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
  "count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show

Gives output:

+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN|              2|
|NASSAU|MARY|              1|
+------+----+---------------+

来源：https://stackoverflow.com/questions/40011756/sort-by-a-key-but-value-has-more-than-one-element-using-scala

标签

arrays

scala

sorting

apache-spark