word-frequency

word count frequency in document

 ̄綄美尐妖づ 提交于 2019-12-05 15:04:28
I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far private Hashtable<String, Integer> getAllWordCount() private Hashtable<String, Integer> getAllWordCount() { Hashtable<String, Integer> result = new

Combining Lists of Word Frequency Data

早过忘川 提交于 2019-12-05 00:58:10
This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results. I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands: {{"the

Awk: Characters-frequency from one text file?

穿精又带淫゛_ 提交于 2019-12-04 18:14:54
Given a multilangual .txt files such as: But where is Esope the holly Bastard But where is 생 지 옥 이 군 지 옥 이 지 옥 지 我 是 你 的 爸 爸 ! 爸 爸 ! ! ! 你 不 會 的 ! I counted space-separated words' word-frequency using this Awk function : $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort Getting the elegant : 1 생 1 군 1 Bastard 1 Esope 1 holly 1 the 1 不 1 我 1 是 1 會 2 이 2 But 2 is 2 where 2 你 2 的 3 옥 4 지 4 爸 5 ! How to change it to count characters-frequency ? EDIT: For Characters-frequency, I used (@Sudo_O's answer): $ grep -o '\S' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}'

Python nltk counting word and phrase frequency

喜欢而已 提交于 2019-12-04 15:09:18
问题 I am using NLTK and trying to get the word phrase count up to a certain length for a particular document as well as the frequency of each phrase. I tokenize the string to get the data list. from nltk.util import ngrams from nltk.tokenize import sent_tokenize, word_tokenize from nltk.collocations import * data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"] bigrams = ngrams(data, 2) bigrams_c = {} for b in

To count the frequency of each word

时光毁灭记忆、已成空白 提交于 2019-12-03 14:06:32
问题 There's a directory with a few text files. How do I count the frequency of each word in each file? A word means a set of characters that can contain the letters, the digits and the underlining characters. 回答1: Here is a solution that should count all the word frequencies in a file: private void countWordsInFile(string file, Dictionary<string, int> words) { var content = File.ReadAllText(file); var wordPattern = new Regex(@"\w+"); foreach (Match match in wordPattern.Matches(content)) { int

Python nltk counting word and phrase frequency

为君一笑 提交于 2019-12-03 08:51:40
I am using NLTK and trying to get the word phrase count up to a certain length for a particular document as well as the frequency of each phrase. I tokenize the string to get the data list. from nltk.util import ngrams from nltk.tokenize import sent_tokenize, word_tokenize from nltk.collocations import * data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"] bigrams = ngrams(data, 2) bigrams_c = {} for b in bigrams: if b not in bigrams_c: bigrams_c[b] = 1 else: bigrams_c[b] += 1 the above code gives and output

Awk: Words frequency from one text file, how to ouput into myFile.txt?

左心房为你撑大大i 提交于 2019-12-02 06:35:48
Given a .txt files with space separated words such as: But where is Esope the holly Bastard But where is And the Awk function : cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}' I get the following output in my console : 1 Bastard 1 Esope 1 holly 1 the 2 But 2 is 2 where How to get into printed into myFile.txt ? I actually have 300.000 lines and near 2 millions words. Better to output the result into a file. EDIT: Used answer (by @Sudo_O): $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt Your pipeline isn't very

String to Dictionary Word Count

╄→尐↘猪︶ㄣ 提交于 2019-12-02 01:16:16
问题 So I'm having trouble with a homework question. Write a function word_counter(input_str) which takes a string input_str and returns a dictionary mapping words in input_str to their occurrence counts. So the code I have so far is: def word_counter(input_str): '''function that counts occurrences of words in a string''' sentence = input_str.lower().split() counts = {} for w in sentence: counts[w] = counts.get(w, 0) + 1 items = counts.items() sorted_items = sorted(items) return sorted_items Now

String to Dictionary Word Count

谁说胖子不能爱 提交于 2019-12-01 20:59:06
So I'm having trouble with a homework question. Write a function word_counter(input_str) which takes a string input_str and returns a dictionary mapping words in input_str to their occurrence counts. So the code I have so far is: def word_counter(input_str): '''function that counts occurrences of words in a string''' sentence = input_str.lower().split() counts = {} for w in sentence: counts[w] = counts.get(w, 0) + 1 items = counts.items() sorted_items = sorted(items) return sorted_items Now when I run the code with a test case such as word_counter("This is a sentence") in the Python shell I

Word frequency in Solr

不羁的心 提交于 2019-11-30 22:07:49
I am trying to get frequency of words using solr. When I give this query : localSolr/solr/select?q=someQuery&rows=0&facet=true&facet.field=content&wt=xml solr gives me the frequencies like; <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="content"> <int name="word1">24</int> <int name="word2">12</int> <int name="word3">8</int> But when I count the words; I find that word2's actual count number is 13. Solr counts same words in the field as one. For example; field text consists; word2 word5 word7 word9 word2 . Solr doesn't return word2's count number 2