In Scala how do I remove duplicates from a list?

前端 未结 8 599
旧巷少年郎
旧巷少年郎 2020-12-13 22:57

Suppose I have

val dirty = List(\"a\", \"b\", \"a\", \"c\")

Is there a list operation that returns \"a\", \"b\", \"c\"

相关标签:
8条回答
  • 2020-12-13 23:23

    scala.collection.immutable.List now has a .distinct method.

    So calling dirty.distinct is now possible without converting to a Set or Seq.

    0 讨论(0)
  • 2020-12-13 23:23

    For Already-Sorted Lists

    If you happen to want the distinct items of a list that you know is already sorted, as I have often needed, the following performs about twice the speed as .distinct:

      def distinctOnSorted[V](seq: List[V]): List[V] =
        seq.foldLeft(List[V]())((result, v) =>
          if (result.isEmpty || v != result.head) v :: result else result)
        .reverse
    

    Performance results on a list of 100,000,000 random Ints from 0-99:

    distinct        : 0.6655373s
    distinctOnSorted: 0.2848134s
    

    Performance with MutableList or ListBuffer

    While it would seem that a more mutable / non-functional programming approach might be faster than prepending to an immutable list, practice shows otherwise. The immutable implementation consistently performs better. My guess for the reason is that scala focuses its compiler optimizations on immutable collections, and does a good job at it. (I welcome others to submit better implementations.)

    List size 1e7, random 0 to 1e6
    ------------------------------
    distinct            : 4562.2277ms
    distinctOnSorted    : 201.9462ms
    distinctOnSortedMut1: 4399.7055ms
    distinctOnSortedMut2: 246.099ms
    distinctOnSortedMut3: 344.0758ms
    distinctOnSortedMut4: 247.0685ms
    
    List size 1e7, random 0 to 100
    ------------------------------
    distinct            : 88.9158ms
    distinctOnSorted    : 41.0373ms
    distinctOnSortedMut1: 3283.8945ms
    distinctOnSortedMut2: 54.4496ms
    distinctOnSortedMut3: 58.6073ms
    distinctOnSortedMut4: 51.4153ms
    

    Implementations:

    object ListUtil {
      def distinctOnSorted[V](seq: List[V]): List[V] =
        seq.foldLeft(List[V]())((result, v) =>
          if (result.isEmpty || v != result.head) v :: result else result)
        .reverse
    
      def distinctOnSortedMut1[V](seq: List[V]): Seq[V] = {
        if (seq.isEmpty) Nil
        else {
          val result = mutable.MutableList[V](seq.head)
          seq.zip(seq.tail).foreach { case (prev, next) =>
            if (prev != next) result += next
          }
          result //.toList
        }
      }
    
      def distinctOnSortedMut2[V](seq: List[V]): Seq[V] = {
        val result = mutable.MutableList[V]()
        if (seq.isEmpty) return Nil
        result += seq.head
        var prev = seq.head
        for (v <- seq.tail) {
          if (v != prev) result += v
          prev = v
        }
        result //.toList
      }
    
      def distinctOnSortedMut3[V](seq: List[V]): List[V] = {
        val result = mutable.MutableList[V]()
        if (seq.isEmpty) return Nil
        result += seq.head
        var prev = seq.head
        for (v <- seq.tail) {
          if (v != prev) v +=: result
          prev = v
        }
        result.reverse.toList
      }
    
      def distinctOnSortedMut4[V](seq: List[V]): Seq[V] = {
        val result = ListBuffer[V]()
        if (seq.isEmpty) return Nil
        result += seq.head
        var prev = seq.head
        for (v <- seq.tail) {
          if (v != prev) result += v
          prev = v
        }
        result //.toList
      }
    }
    

    Test:

    import scala.util.Random
    
    class ListUtilTest extends UnitSpec {
      "distinctOnSorted" should "return only the distinct elements in a sorted list" in {
        val bigList = List.fill(1e7.toInt)(Random.nextInt(100)).sorted
    
        val t1 = System.nanoTime()
        val expected = bigList.distinct
        val t2 = System.nanoTime()
        val actual = ListUtil.distinctOnSorted[Int](bigList)
        val t3 = System.nanoTime()
        val actual2 = ListUtil.distinctOnSortedMut1(bigList)
        val t4 = System.nanoTime()
        val actual3 = ListUtil.distinctOnSortedMut2(bigList)
        val t5 = System.nanoTime()
        val actual4 = ListUtil.distinctOnSortedMut3(bigList)
        val t6 = System.nanoTime()
        val actual5 = ListUtil.distinctOnSortedMut4(bigList)
        val t7 = System.nanoTime()
    
        actual should be (expected)
        actual2 should be (expected)
        actual3 should be (expected)
        actual4 should be (expected)
        actual5 should be (expected)
    
        val distinctDur = t2 - t1
        val ourDur = t3 - t2
    
        ourDur should be < (distinctDur)
    
        print(s"distinct            : ${distinctDur / 1e6}ms\n")
        print(s"distinctOnSorted    : ${ourDur / 1e6}ms\n")
        print(s"distinctOnSortedMut1: ${(t4 - t3) / 1e6}ms\n")
        print(s"distinctOnSortedMut2: ${(t5 - t4) / 1e6}ms\n")
        print(s"distinctOnSortedMut3: ${(t6 - t5) / 1e6}ms\n")
        print(s"distinctOnSortedMut4: ${(t7 - t6) / 1e6}ms\n")
      }
    }
    
    0 讨论(0)
  • 2020-12-13 23:23

    You can also use recursion and pattern matching:

    def removeDuplicates[T](xs: List[T]): List[T] = xs match {
      case Nil => xs
      case head :: tail => head :: removeDuplicates(for (x <- tail if x != head) yield x)
    }
    
    
    0 讨论(0)
  • 2020-12-13 23:25

    Using Set in the first place is the right way to do it, of course, but:

    scala> List("a", "b", "a", "c").toSet.toList
    res1: List[java.lang.String] = List(a, b, c)
    

    Works. Or just toSet as it supports the Seq Traversable interface.

    0 讨论(0)
  • 2020-12-13 23:26

    The algorithmic way...

    def dedupe(str: String): String = {
      val words = { str split " " }.toList
    
      val unique = words.foldLeft[List[String]] (Nil) {
        (l, s) => {
          val test = l find { _.toLowerCase == s.toLowerCase } 
          if (test == None) s :: l else l
        }
      }.reverse
    
      unique mkString " "
    }
    
    0 讨论(0)
  • 2020-12-13 23:30

    Before using Kitpon's solution, think about using a Set rather than a List, it ensures each element is unique.

    As most list operations (foreach, map, filter, ...) are the same for sets and lists, changing collection could be very easy in the code.

    0 讨论(0)
提交回复
热议问题