Removing duplicates

送分小仙女□ 提交于 2019-12-02 07:23:53

Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).

Given the code: (modified SeqLike.distinct)

type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
   val b = newBuilder
   val seen = mutable.HashSet[AnyRef]()
   val key = f(x)
   for (x <- rows) {
     if (!seen(key)) {
       b += x
       seen += key
     }
   }
   b.result
 }

If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with

distinct(rows, (_._2))

Do you need to have your code reproducible? If not, then in excel, click on the "Data" tab, click the little box directly above "1" and to the left of "A" to highlight everything, click "Remove Duplicates", make sure "My data has headers" is selected if you have headers, and then unclick the column that has the years, only keeping the column that has the sentence with a check mark next to it. This will remove duplicate sentences but keep the first instance of the year occuring.

As sets naturally eliminate duplicates, a simple approach would be to fill the rows into a TreeSet, using a custom ordering which only takes into account the text part of each row.

Update

Here is a sample script to demonstrate the above:

import collection.immutable.TreeSet
import scala.io.Source

val lines = Source.fromFile("science.csv").getLines()
val uniques = lines.foldLeft(TreeSet[String]()(Ordering.by(_.split(',')(1)))) {
  (s, l) =>
    if (s contains l) s
    else s + l
}
uniques.toList.sorted foreach println

The script folds the sequence of lines into a treeset with a custom ordering based on the 2nd part of the comma-separated line. The simplest fold function would be (s, l) => s + l; however, that would result in the lines with later year overwriting lines with the same text of earlier years. This is why I had to test for containment first.

Now we are almost ready, we just need to reorder the collection again by year before printing (this assuming the input was ordered by year).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!