How to find duplicates in a list?

问题

I have a list of unsorted integers and I want to find those elements which have duplicates.

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)

I can find the distinct elements of the set with dup.distinct, so I wrote my answer as follows.

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
val distinct = dup.distinct
val elementsWithCounts = distinct.map( (a:Int) => (a, dup.count( (b:Int) => a == b )) )
val duplicatesRemoved = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 <= 1 } )
val withDuplicates = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 > 1 } )

Is there an easier way to solve this?

回答1:

Try this:

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup.groupBy(identity).collect { case (x, List(_,_,_*)) => x }

The groupBy associates each distinct integer with a list of its occurrences. The collect is basically map where non-matching elements are ignored. The match pattern following case will match integers x that are associated with a list that fits the pattern List(_,_,_*), a list with at least two elements, each represented by an underscore since we don't actually need to store those values (and those two elements can be followed by zero or more elements: _*).

You could also do:

dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => x }

It's much faster than the approach you provided since it doesn't have to repeatedly pass over the data.

回答2:

A bit late to the party, but here's another approach:

dup.diff(dup.distinct).distinct

diff gives you all the extra items above those in the argument (dup.distinct), which are the duplicates.

回答3:

Another approach is to use foldLeft and do it the hard way.

We start with two empty sets. One is for elements that we have seen at least once. The other for elements that we have seen at least twice (aka duplicates).

We traverse the list. When the current element has already been seen (seen(cur)) it is a duplicate and therefore added to duplicates. Otherwise we add it to seen. The result is now the second set that contains the duplicates.

We can also write this as a generic method.

def dups[T](list: List[T]) = list.foldLeft((Set.empty[T], Set.empty[T])){ case ((seen, duplicates), cur) => 
  if(seen(cur)) (seen, duplicates + cur) else (seen + cur, duplicates)      
}._2

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)

dups(dup) //Set(1,5,101)

回答4:

Summary: I've written a very efficient function which returns both List.distinct and a List consisting of each element which appeared more than once and the index at which the element duplicate appeared.

Details: If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List (as ordering was significant) exactly once and returns a Tuple2 consisting of the original List deduped (all duplicates after the first are removed; i.e. the same as invoking distinct) and a second List showing each duplicate and an Int index at which it occurred within the original List.

I have implemented the function twice based on the general performance characteristics of the Scala collections; filterDupesL (where the L is for Linear) and filterDupesEc (where the Ec is for Effectively Constant).

Here's the "Linear" function:

def filterDupesL[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(
      remaining: List[A]
    , index: Int =
        0
    , accumulator: (List[A], List[(A, Int)]) =
        (Nil, Nil)): (List[A], List[(A, Int)]
  ) =
    if (remaining.isEmpty)
      accumulator
    else
      recursive(
          remaining.tail
        , index + 1
        , if (accumulator._1.contains(remaining.head)) //contains is linear
          (accumulator._1, (remaining.head, index) :: accumulator._2)
        else
          (remaining.head :: accumulator._1, accumulator._2)
      )
  val (distinct, dupes) = recursive(items)
  (distinct.reverse, dupes.reverse)
}

An below is an example which might make it a bit more intuitive. Given this List of String values:

val withDupes =
  List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")

...and then performing the following:

val (deduped, dupeAndIndexs) =
  filterDupesL(withDupes)

...the results are:

deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))

And if you just want the duplicates, you simply map across dupeAndIndexes and invoke distinct:

val dupesOnly =
  dupeAndIndexs.map(_._1).distinct

...or all in a single call:

val dupesOnly =
  filterDupesL(withDupes)._2.map(_._1).distinct

...or if a Set is preferred, skip distinct and invoke toSet...

val dupesOnly2 =
  dupeAndIndexs.map(_._1).toSet

...or all in a single call:

val dupesOnly2 =
  filterDupesL(withDupes)._2.map(_._1).toSet

For very large Lists, consider using this more efficient version (which uses an additional Set to change the contains check in effectively constant time):

Here's the "Effectively Constant" function:

def filterDupesEc[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(
      remaining: List[A]
    , index: Int =
        0
    , seenAs: Set[A] =
        Set()
    , accumulator: (List[A], List[(A, Int)]) =
        (Nil, Nil)): (List[A], List[(A, Int)]
  ) =
    if (remaining.isEmpty)
      accumulator
    else {
      val (isInSeenAs, seenAsNext) = {
        val isInSeenA =
          seenAs.contains(remaining.head) //contains is effectively constant
        (
            isInSeenA
          , if (!isInSeenA)
              seenAs + remaining.head
            else
              seenAs
        )
      }
      recursive(
          remaining.tail
        , index + 1
        , seenAsNext
        , if (isInSeenAs)
          (accumulator._1, (remaining.head, index) :: accumulator._2)
        else
          (remaining.head :: accumulator._1, accumulator._2)
      )
    }
  val (distinct, dupes) = recursive(items)
  (distinct.reverse, dupes.reverse)
}

Both of the above functions are adaptations of the filterDupes function in my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._.

回答5:

My favorite is

def hasDuplicates(in: List[Int]): Boolean = {
  val sorted = in.sortWith((i, j) => i < j)
  val zipped = sorted.tail.zip(sorted)
  zipped.exists(p => p._1 == p._2)
}

来源：https://stackoverflow.com/questions/24729544/how-to-find-duplicates-in-a-list

标签

list

scala

scala-collections