Generating a frequency map for a string in Scala

后端 未结 4 1138
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-29 04:46

Let\'s say I have a string, \"hello\", and I want to generate a character frequency map:

Map[Char,Int] = Map(h -> 1, e -> 1, o -> 1, l -> 2)


        
相关标签:
4条回答
  • 2020-12-29 04:57

    Extending Axel's answer.

    Your groupBy solution is already functional. There's just a tiny-tiny correction to it which could make it cleaner:

    str.groupBy(_.toChar).mapValues(_.size)
    

    The Scala's alternative to inject is foldLeft, foldRight, reduce, reduceOption depending on how you use it. The way you've used inject in Ruby is not functional, since your solution is based on mutating h and in functional world mutability is a "no-no". Here's how you'd do the solution close to your inject but in functional style in Scala:

    str.foldLeft( Map[Char, Int]() ){ (m, c) => m + (c -> (m.getOrElse(c, 0) + 1)) }
    

    Obviously groupBy looks much better.

    0 讨论(0)
  • 2020-12-29 05:08

    1) What does p mean?

    groupBy takes a function which maps an elements to a key of type K. When invoked on some collection Coll, it returns a Map[K, Coll] which contains mappings from keys K to all the elements which mapped to the same key.

    So, in your case, str.groupBy(_.toChar) yields a map mapping from a key k (which is a character) to a string with all the elements (characters) c such that k == c.toChar. You get this:

    Map(e -> "e", h -> "h", l -> "ll", o -> "o")
    

    A Map is an iterable of pairs of keys and values. In this case, each pair is a character and a string of elements. Calling the map operation on a Map involves mapping on these pairs - p is a pair where p._1 is a character, and p._2 is the associated string (on which you can call length, as you did above).

    2) How to do this idiomatically

    The above is how to do it idiomatically - using groupBy and map. Alternatively, you can use an immutable map and recursion on the string length to compute the frequencies, or an immutable map and a foldLeft.

    3) Performance characteristic

    Best to benchmark to see the differences. Here are a couple of microbenchmark for a highly-repetitive string (~3GHz iMac, JDK7, Scala 2.10.0 nightly):

    object Imperative extends testing.Benchmark {
      val str = "abc" * 750000
    
      def run() {
        var counts = new scala.collection.mutable.HashMap[Char,Int]
        var i = 0
        val until = str.length
        while (i < until) {
          var c = str(i)
          if (counts.contains(c))
            counts.put(c, counts(c) + 1)
          else
            counts.put(c, 1)
          i += 1
        }
    
        //println(f)
      }
    }
    
    
    object Combinators extends testing.Benchmark {
      val str = "abc" * 750000
    
      def run() {
        val f = str.groupBy(_.toChar).map(p => (p._1, p._2.length))
      }
    }
    
    
    object Fold extends testing.Benchmark {
      val str = "abc" * 750000
    
      def run() {
        val f = str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}
      }
    }
    

    Results:

    • Imperative: $ 103 57 53 58 53 53 53 53 53 53

    • Combinators: $ 72 51 63 56 53 52 52 54 53 53

    • Fold: $ 163 62 71 62 57 57 57 58 57 57

    Note that changing the imperative version to use withDefaultValue:

    var counts = new scala.collection.mutable.HashMap[Char,Int].withDefaultValue(0)
    var i = 0
    val until = str.length
    while (i < until) {
      var c = str(i)
      counts.put(c, counts(c) + 1)
      i += 1
    }
    

    is apparently terribly slow due to forwarding each put call:

    • withDefaultValue: $ 133 87 109 106 101 100 101 100 101 101

    Conclusion: the boxing and unboxing of characters in this case is high-enough so that the differences in performance between these approaches are hard to observe.

    EDIT:

    Update: You may want to use ScalaMeter inline benchmarking in place of the Benchmark trait.

    0 讨论(0)
  • 2020-12-29 05:13

    Your example on ruby can be almost directly translated to Scala using foldLeft and immutable Map.

    Here is one of possible solutions:

    str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}
    

    Actually, if you are ok with local mutability, you can make something like this:

    def charFrequencies(str: String): collection.Map[Char, Int] = {
      val hash = collection.mutable.HashMap.empty[Char, Int] withDefaultValue 0
      str foreach { hash(_) += 1 }
      hash
    }
    

    Expression hash(_) += 1 will be desugared to c => hash(c) = hash(c) + 1 and then to c => hash.update(c, hash.apply(c) + 1)

    This solution should be more efficient than functional ones, because it don't create intermediate collections. Also because method returns immutable collection.Map[Char, Int], result will be treated as immutable (as long as no one will perform unsafe downcasting on it).

    0 讨论(0)
  • 2020-12-29 05:14

    Starting in Scala 2.13, we can use the groupMapReduce method which is (as its name suggests) an equivalent of a groupBy followed by mapValues and a reduce step:

    "hello".groupMapReduce(identity)(_ => 1)(_ + _)
    // immutable.Map[Char,Int] = Map(e -> 1, h -> 1, l -> 2, o -> 1)
    

    This:

    • groups characters (group part of groupMapReduce)

    • maps each grouped value occurrence to 1 (map part of groupMapReduce)

    • reduces values within a group of values (_ + _) by summing them (reduce part of groupMapReduce).

    This is an equivalent version performed in one pass through the sequence of chars of:

    "hello".groupBy(identity).mapValues(_.map(_ => 1).reduce(_+_))
    
    0 讨论(0)
提交回复
热议问题