.NET library for text algorithms?

放肆的年华 提交于 2019-12-18 10:21:37

问题


Do you know any .NET library for text algorithms??
Especially I'm interested in strings match, and full-text-search algorithms like

  • Bitap algorithm
  • Levenshtein distance
  • Damerau–Levenshtein distance

I know the one I have mentioned are pretty simple to code, but there are hundreds of text algorithms, i don't want to code them all by myself.
If there is no such .NET library known, you can mention C, C++ library, coding wrapper will be easer than coding from zero.


回答1:


You may be interested in checking out the google-diff-match-patch library on Google Code. They have an implementation of Myer's diff algorithm and it claims to also implement a Bitap algorithm "at the heart".

It has the C# source that you're looking for as well as implementations in Java, C++, Lua & Python. Although I don't have the best understanding of how to use Bitap in practice (there are demos in the Google Code project) I think you'll be most interested in the match functions starting around line 1476 of the current version.

UPDATE:

A little digging found an implementation of Levenshtein in C# on CodeProject.

Also, this C# class file contains an implementation of Levenshtein on SourceForge. The implementation is part of the Corsis (aka Tenka Text) project. Author claims that the YetiLevenshtein method (around line 741) is 2x to 10x faster than the implementation used in the CodeProject version of the algorithm referenced above.

UPDATE #2:

I just discovered the wikibook Algorithm implementation with it's C# version of Levenshtein Distance and had to include it because it looks pretty straight and to the point. This wikibook looks like a great reference to keep on hand in general.

Levenshtein Distance in C# (courtesy of Wikibooks)

    private Int32 levenshtein(String a, String b)
    {

        if (string.IsNullOrEmpty(a))
        {
            if (!string.IsNullOrEmpty(b))
            {
                return b.Length;
            }
            return 0;
        }

        if (string.IsNullOrEmpty(b))
        {
            if (!string.IsNullOrEmpty(a))
            {
                return a.Length;
            }
            return 0;
        }

        Int32 cost;
        Int32[,] d = new int[a.Length + 1, b.Length + 1];
        Int32 min1;
        Int32 min2;
        Int32 min3;

        for (Int32 i = 0; i <= d.GetUpperBound(0); i += 1)
        {
            d[i, 0] = i;
        }

        for (Int32 i = 0; i <= d.GetUpperBound(1); i += 1)
        {
            d[0, i] = i;
        }

        for (Int32 i = 1; i <= d.GetUpperBound(0); i += 1)
        {
            for (Int32 j = 1; j <= d.GetUpperBound(1); j += 1)
            {
                cost = Convert.ToInt32(!(a[i-1] == b[j - 1]));

                min1 = d[i - 1, j] + 1;
                min2 = d[i, j - 1] + 1;
                min3 = d[i - 1, j - 1] + cost;
                d[i, j] = Math.Min(Math.Min(min1, min2), min3);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];

    }



回答2:


I managed to find implementations of most algorithms i need using combination of WikiPedia + Google Code search.

http://en.wikipedia.org/wiki/Category:Algorithms_on_strings
http://www.google.com/codesearch

Though it's strange that no one has created project on this subject, where interested people could collaborate on this.




回答3:


If you're doing string matching, Lucene.Net is worth a look.

However, I know this is not exactly what you're after, and while you can find most of those algorithms in some C# form around, I know of no library incorporating them (I've tended to keep a couple of these in my personal library).

Out of interest, why would you ever need more than one of these full-match algorithms with a couple of threshold parameters?




回答4:


here is one I implemented for Levenshtein / Damerau–Levenshtein distance:

    public static int GetDistance(string left, string right, bool isDamerauDistanceApplied)
    {
        if (left.Length == 0) return right.Length;
        if (right.Length == 0) return left.Length;

        var lenLeft = left.Length;
        var lenRight = right.Length;

        var matrix = new int[lenLeft + 1, lenRight + 1];

        for (var i = 0; i <= lenLeft; i++)
            matrix[i, 0] = i;

        for (var i = 0; i <= lenRight; i++)
            matrix[0, i] = i;

        for (var i = 1; i <= lenLeft; i++)
        {
            for (var j = 1; j <= lenRight; j++)
            {
                var cost = (left[i - 1] == right[j - 1]) ? 0 : 1;

                matrix[i, j] = Math.Min(Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1), matrix[i - 1, j - 1] + cost);

                if (isDamerauDistanceApplied)
                {
                    // Fixed for string base 0 index.
                    if (i > 1 && j > 1 && left[i - 1] == right[j - 2] && left[i - 2] == right[j - 1])
                    {
                        matrix[i, j] = Math.Min(matrix[i, j], matrix[i - 2, j - 2] + cost);
                    }
                }
            }
        }

        return matrix[lenLeft, lenRight];
    }



回答5:


I suggest SimMetrics library, it has many different algorithms for string matching. Available also on NuGet.

Short description:

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).

GPLv2 license.




回答6:


I found and used the following .NET library implementing Aho-Corasick text mathcing from Tom Petricek on a problem I had. It worked great for me.



来源:https://stackoverflow.com/questions/4508307/net-library-for-text-algorithms

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!