How to determine a strings dna for likeness to another

后端 未结 6 2069
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-28 20:44

I am hoping I am wording this correctly to get across what I am looking for.

I need to compare two pieces of text. If the two strings are alike I would like to get s

6条回答
  •  不知归路
    2020-12-28 21:03

    Many people have suggested looking at distance/metric like approaches, and I think the wording of the question leads that way. (By the way, a hash like md5 is trying to do pretty much the opposite thing that a metric does, so it's hardly surprising that this wouldn't work for you. There are similar ideas that don't change much under small deltas, but I suspect they don't encode enough information for what you want to do)

    Particularly given your update in the comments though, I think this type of approach is not very helpful.

    What you are looking for is more of a clustering problem, where you want to generate a signature (i.e. feature vector) from each email and later compare it to new inputs. So essentially what you have is a machine learning problem. Deciding what "close" means may be a bit of a challenge. To get started though, assuming it actually is emails you're looking at you may do well to look at the sorts of feature generation done by many spam-filters, this will give you (probably Euclidean, at least to start) a space to measure distances in based on a signature (feature vector).

    Without knowing more about your problem it's hard to be more specific.

提交回复
热议问题