Extract keywords from text in .NET

非 Y 不嫁゛ 提交于 2019-12-06 04:39:10

问题


I need to calculate how many times each keyword is reoccurring in a string, with sorting by highest number. What's the fastest algorithm available in .NET code for this purpose?


回答1:


EDIT: code below groups unique tokens with count

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

This is finally starting to make more sense to me...

EDIT: code below results in count correlated with target substring:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select((t, index) => new {str = t, 
    count = src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))});

Results is now:

+       [0] { str = "string", count = 4 }   <Anonymous Type>
+       [1] { str = "the", count = 4 }  <Anonymous Type>
+       [2] { str = "in", count = 6 }   <Anonymous Type>

Original code below:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select(t => src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))).OrderByDescending(t => t);

with grateful acknowledgement to this previous response.

Results from debugger (which need extra logic to include the matching string with its count):

-       results {System.Linq.OrderedEnumerable<int,int>}    
-       Results View    Expanding the Results View will enumerate the IEnumerable   
        [0] 6   int
        [1] 4   int
        [2] 4   int



回答2:


Dunno about fastest, but Linq is probably the most understandable:

var myListOfKeywords = new [] {"struct", "public", ...};

var keywordCount = from keyword in myProgramText.Split(new []{" ","(", ...})
   group by keyword into g
   where myListOfKeywords.Contains(g.Key)
   select new {g.Key, g.Count()}

foreach(var element in keywordCount)
   Console.WriteLine(String.Format("Keyword: {0}, Count: {1}", element.Key, element.Count));

You can write this in a non-Linq-y way, but the basic premise is the same; split the string up into words, and count the occurrences of each word of interest.




回答3:


Simple algorithm: Split the string into an array of words, iterate over this array, and store the count of each word in a hash table. Sort by count when done.




回答4:


You could break the string into a collection of strings, one for each word, and then do a LINQ query on the collection. While I doubt it would be the fastest, it would probably be faster than regex.



来源:https://stackoverflow.com/questions/4035563/extract-keywords-from-text-in-net

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!