Algorithm to find the most common substrings in a string

后端 未结 5 2114
耶瑟儿~
耶瑟儿~ 2020-12-01 03:36

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have \"hello world\" as its most

5条回答
  •  离开以前
    2020-12-01 04:28

    Since for every substring of a String of length >= 2 the text contains at least one substring of length 2 at least as many times, we only need to investigate substrings of length 2.

    val s = "hello world this is hello world. hello world repeats three times in this string!"
    
    val li = s.sliding (2, 1).toList
    // li: List[String] = List(he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " t", th, hi, is, "s ", " i", is, "s ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, d., ". ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " r", re, ep, pe, ea, at, ts, "s ", " t", th, hr, re, ee, "e ", " t", ti, im, me, es, "s ", " i", in, "n ", " t", th, hi, is, "s ", " s", st, tr, ri, in, ng, g!)
    
    val uniques = li.toSet
    uniques.toList.map (u => li.count (_ == u))
    // res18: List[Int] = List(1, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 1, 3, 2, 1, 3, 1, 3, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 3, 3, 2, 4, 1, 2, 2, 1)
    
    uniques.toList(6)
    res19: String = "s "
    

提交回复
热议问题