问题
I have a list of unsorted integers and I want to find those elements which have duplicates.
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
I can find the distinct elements of the set with dup.distinct, so I wrote my answer as follows.
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
val distinct = dup.distinct
val elementsWithCounts = distinct.map( (a:Int) => (a, dup.count( (b:Int) => a == b )) )
val duplicatesRemoved = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 <= 1 } )
val withDuplicates = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 > 1 } )
Is there an easier way to solve this?
回答1:
Try this:
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup.groupBy(identity).collect { case (x, List(_,_,_*)) => x }
The groupBy
associates each distinct integer with a list of its occurrences. The collect
is basically map
where non-matching elements are ignored. The match pattern following case
will match integers x
that are associated with a list that fits the pattern List(_,_,_*)
, a list with at least two elements, each represented by an underscore since we don't actually need to store those values (and those two elements can be followed by zero or more elements: _*
).
You could also do:
dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => x }
It's much faster than the approach you provided since it doesn't have to repeatedly pass over the data.
回答2:
A bit late to the party, but here's another approach:
dup.diff(dup.distinct).distinct
diff
gives you all the extra items above those in the argument (dup.distinct
), which are the duplicates.
回答3:
Another approach is to use foldLeft
and do it the hard way.
We start with two empty sets. One is for elements that we have seen at least once. The other for elements that we have seen at least twice (aka duplicates).
We traverse the list. When the current element has already been seen (seen(cur)
) it is a duplicate and therefore added to duplicates
. Otherwise we add it to seen
.
The result is now the second set that contains the duplicates.
We can also write this as a generic method.
def dups[T](list: List[T]) = list.foldLeft((Set.empty[T], Set.empty[T])){ case ((seen, duplicates), cur) =>
if(seen(cur)) (seen, duplicates + cur) else (seen + cur, duplicates)
}._2
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dups(dup) //Set(1,5,101)
回答4:
Summary:
I've written a very efficient function which returns both List.distinct
and a List
consisting of each element which appeared more than once and the index at which the element duplicate appeared.
Details:
If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List
(as ordering was significant) exactly once and returns a Tuple2
consisting of the original List
deduped (all duplicates after the first are removed; i.e. the same as invoking distinct
) and a second List
showing each duplicate and an Int
index at which it occurred within the original List
.
I have implemented the function twice based on the general performance characteristics of the Scala collections; filterDupesL
(where the L is for Linear) and filterDupesEc
(where the Ec is for Effectively Constant).
Here's the "Linear" function:
def filterDupesL[A](items: List[A]): (List[A], List[(A, Int)]) = {
def recursive(
remaining: List[A]
, index: Int =
0
, accumulator: (List[A], List[(A, Int)]) =
(Nil, Nil)): (List[A], List[(A, Int)]
) =
if (remaining.isEmpty)
accumulator
else
recursive(
remaining.tail
, index + 1
, if (accumulator._1.contains(remaining.head)) //contains is linear
(accumulator._1, (remaining.head, index) :: accumulator._2)
else
(remaining.head :: accumulator._1, accumulator._2)
)
val (distinct, dupes) = recursive(items)
(distinct.reverse, dupes.reverse)
}
An below is an example which might make it a bit more intuitive. Given this List of String values:
val withDupes =
List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")
...and then performing the following:
val (deduped, dupeAndIndexs) =
filterDupesL(withDupes)
...the results are:
deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))
And if you just want the duplicates, you simply map
across dupeAndIndexes
and invoke distinct
:
val dupesOnly =
dupeAndIndexs.map(_._1).distinct
...or all in a single call:
val dupesOnly =
filterDupesL(withDupes)._2.map(_._1).distinct
...or if a Set
is preferred, skip distinct
and invoke toSet
...
val dupesOnly2 =
dupeAndIndexs.map(_._1).toSet
...or all in a single call:
val dupesOnly2 =
filterDupesL(withDupes)._2.map(_._1).toSet
For very large List
s, consider using this more efficient version (which uses an additional Set
to change the contains
check in effectively constant time):
Here's the "Effectively Constant" function:
def filterDupesEc[A](items: List[A]): (List[A], List[(A, Int)]) = {
def recursive(
remaining: List[A]
, index: Int =
0
, seenAs: Set[A] =
Set()
, accumulator: (List[A], List[(A, Int)]) =
(Nil, Nil)): (List[A], List[(A, Int)]
) =
if (remaining.isEmpty)
accumulator
else {
val (isInSeenAs, seenAsNext) = {
val isInSeenA =
seenAs.contains(remaining.head) //contains is effectively constant
(
isInSeenA
, if (!isInSeenA)
seenAs + remaining.head
else
seenAs
)
}
recursive(
remaining.tail
, index + 1
, seenAsNext
, if (isInSeenAs)
(accumulator._1, (remaining.head, index) :: accumulator._2)
else
(remaining.head :: accumulator._1, accumulator._2)
)
}
val (distinct, dupes) = recursive(items)
(distinct.reverse, dupes.reverse)
}
Both of the above functions are adaptations of the filterDupes
function in my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._
.
回答5:
My favorite is
def hasDuplicates(in: List[Int]): Boolean = {
val sorted = in.sortWith((i, j) => i < j)
val zipped = sorted.tail.zip(sorted)
zipped.exists(p => p._1 == p._2)
}
来源:https://stackoverflow.com/questions/24729544/how-to-find-duplicates-in-a-list