Count the number of unique words and occurrence of each word from txt file

两盒软妹~` 提交于 2019-12-06 16:29:46

问题


currently i trying to create an application to do some text processing to read in a text file, then I use a dictionary to create index of words, technically it will be like this .. program will be run and reading a text file then checking it, to see if the word is already in that file or not and what the id word for it as a unique word . If so, it will print out the index number and total of appearance for each word they meet and continue to check for entire file. and produce something like this: http://pastebin.com/CjtcYchF

Here is an example of the text file I'm inputting: http://pastebin.com/ZRVbhWhV A quick ctrl-F shows that "not" occurs 2 times and "that" occurs 4 times. What I need to do is to index each word and call it in like this:

sample input : "that I have not that place sunrise beach like not good dirty beach trash beach" 

    dictionary :            output.txt / output.dat:
    index word                     
      1    I                4:2 1:1 2:1 3:2 5:1 6:1 7:3 8:1 9:1 10:1 11:1
      2   have                   
      3   not                    
      4   that                   
      5   place                  
      6   sunrise
      7   beach
      8   like
      9   good
      10  dirty
      11  trash                  

I've tried to implement some code to create the dictionary. Here is what I have so far:

   private void bagofword_Click(object sender, EventArgs e)
            {
                //creating dictionary in background
                    //Dictionary<string, int> dict = new Dictionary<string, int>();
                    string rawinputbow = File.ReadAllText(textBox31.Text);
                    //string[] inputbow = rawinputbow.Split(' ');

                    var inputbow = rawinputbow.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
                                   .ToList();
                    var dict = new OrderedDictionary();
                    var output = new List<int>();

                    foreach (var element in inputbow.Select((word, index) => new { word, index }))
                    {

                        if (dict.Contains(element.word))
                        {
                            var count = (int)dict[element.word];
                            dict[element.word] = ++count;
                            output.Add(GetIndex(dict, element.word));
                            //textBoxfile.Text = output.ToString();
                           // textBoxfile.Text = inputbow.ToString();
                            string result = string.Join(",", output);
                            textBoxfile.Text = result.ToString();
                        }
                        else
                        {
                            dict[element.word] = 1;
                            output.Add(GetIndex(dict, element.word));
                            //textBoxfile.Text = dict.ToString();
                            string result = string.Join(",", output);
                            textBoxfile.Text = result.ToString();
                        }

                    }
    }

    public int GetIndex(OrderedDictionary dictionary, string key)
            {
                for (int index = 0; index < dictionary.Count; index++)
                {
                    if (dictionary[index] == dictionary[key])                   
                        return index; // We found the item       
                        //textBoxfile.Text = index.ToString();
                }

                return -1;
            }

Does anyone know how to complete that code? Any help is much appreciated!


回答1:


Use this code

  string input = "that I have not that place sunrise beach like not good dirty beach trash beach";
        var wrodList = input.Split(null);
        var output = wrodList.GroupBy(x => x).Select(x => new Word { charchter = x.Key, repeat = x.Count() }).OrderBy(x=>x.repeat);
        foreach (var item in output)
        {
            textBoxfile.Text += item.charchter +" : "+ item.repeat+Environment.NewLine;
        }

class for holding data

 public class word
    {
        public string  charchter { get; set; }
        public int repeat { get; set; }
    }



回答2:


Spliting on whitespace is not enough. You have some words like temple, photos. or cafes/restaraunts. A better approach would be using a regex like \w+. Also the words should be compared in case insensitive way.

My approach would be:

var words = Regex.Matches(File.ReadAllText(filename), @"\w+").Cast<Match>()
            .Select((m, pos) => new { Word = m.Value, Pos = pos })
            .GroupBy(s => s.Word, StringComparer.CurrentCultureIgnoreCase)
            .Select(g => new { Word = g.Key, PosInText = g.Select(z => z.Pos).ToList() })
            .ToList();


foreach(var item in words)
{
    Console.WriteLine("{0,-15} POS:{1}", item.Word, string.Join(",", item.PosInText));
}


for (int i = 0; i < words.Count; i++)
{
    Console.Write("{0}:{1} ", i, words[i].PosInText.Count);
} 



回答3:


### Sample code for you to tweak for your needs:
touch test.txt
echo "ravi chandran marappan 30" > test.txt                                                                                                                                     
echo "ramesh kumar marappan 24" >> test.txt
echo "ram lakshman marappan 22" >> test.txt
sed -e 's/ /\n/g' test.txt | sort | uniq | awk '{print "echo """,$1,
"""`grep -wc ",$1," test.txt`"}' | sh

Results:                          
22 -1                                                                                                                                                         
24 -1                                                                                                                                                         
30 -1                                                                                                                                                         
chandran -1                                                                                                                                                   
kumar -1                                                                                                                                                      
lakshman -1                                                                                                                                                   
marappan -3                                                                                                                         
ram -1                                                                                                                            
ramesh -1                                                                                                                       
ravi -1


来源:https://stackoverflow.com/questions/32362427/count-the-number-of-unique-words-and-occurrence-of-each-word-from-txt-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!