How python-Levenshtein.ratio is computed

后端 未结 4 1980
庸人自扰
庸人自扰 2020-12-13 06:50

According to the python-Levenshtein.ratio source:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722

it\'s computed a

4条回答
  •  甜味超标
    2020-12-13 07:25

    By looking more carefully at the C code, I found that this apparent contradiction is due to the fact that ratio treats the "replace" edit operation differently than the other operations (i.e. with a cost of 2), whereas distance treats them all the same with a cost of 1.

    This can be seen in the calls to the internal levenshtein_common function made within ratio_py function:


    https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L727

    static PyObject*
    ratio_py(PyObject *self, PyObject *args)
    {
      size_t lensum;
      long int ldist;
    
      if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0) //Call
        return NULL;
    
      if (lensum == 0)
        return PyFloat_FromDouble(1.0);
    
      return PyFloat_FromDouble((double)(lensum - ldist)/(lensum));
    }
    

    and by distance_py function:

    https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L715

    static PyObject*
    distance_py(PyObject *self, PyObject *args)
    {
      size_t lensum;
      long int ldist;
    
      if ((ldist = levenshtein_common(args, "distance", 0, &lensum)) < 0)
        return NULL;
    
      return PyInt_FromLong((long)ldist);
    }
    

    which ultimately results in different cost arguments being sent to another internal function, lev_edit_distance, which has the following doc snippet:

    @xcost: If nonzero, the replace operation has weight 2, otherwise all
            edit operations have equal weights of 1.
    

    Code of lev_edit_distance():

    /**
     * lev_edit_distance:
     * @len1: The length of @string1.
     * @string1: A sequence of bytes of length @len1, may contain NUL characters.
     * @len2: The length of @string2.
     * @string2: A sequence of bytes of length @len2, may contain NUL characters.
     * @xcost: If nonzero, the replace operation has weight 2, otherwise all
     *         edit operations have equal weights of 1.
     *
     * Computes Levenshtein edit distance of two strings.
     *
     * Returns: The edit distance.
     **/
    _LEV_STATIC_PY size_t
    lev_edit_distance(size_t len1, const lev_byte *string1,
                      size_t len2, const lev_byte *string2,
                      int xcost)
    {
      size_t i;
    

    [ANSWER]

    So in my example,

    ratio('ab', 'ac') implies a replacement operation (cost of 2), over the total length of the strings (4), hence 2/4 = 0.5.

    That explains the "how", I guess the only remaining aspect would be the "why", but for the moment I'm satisfied with this understanding.

提交回复
热议问题